[ https://issues.apache.org/jira/browse/IMPALA-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Becker updated IMPALA-7471: ---------------------------------- Target Version: Impala 4.3.0 (was: Impala 4.2.0) > Impala crashes or returns incorrect results when querying parquet nested types > ------------------------------------------------------------------------------ > > Key: IMPALA-7471 > URL: https://issues.apache.org/jira/browse/IMPALA-7471 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Tim Armstrong > Assignee: Csaba Ringhofer > Priority: Critical > Labels: complextype, correctness, crash, parquet > Attachments: test_users_131786401297925138_0.parquet > > > From > http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779 > {quote}We found a case where Impala returns incorrect values from simple > query. Our data contains nested array of structures and structures contains > other structures. > We generated minimal sample data allowing to reproduce the issue. > > SQL to create a table: > {quote} > {code} > CREATE TABLE plat_test.test_users ( > id INT, > name STRING, > devices ARRAY< > STRUCT< > id:STRING, > device_info:STRUCT< > model:STRING > > > > > > > ) > STORED AS PARQUET > {code} > {quote} > Please put attached parquet file to the location of the table and refresh the > table. > In sample data we have 2 users, one with 2 devices, second one with 3. Some > of the devices.device_info.model fields are NULL. > > When I issue a query: > {quote} > {code} > SELECT u.name, d.device_info.model as model > FROM test_users u, > u.devices d; > {code} > {quote} > I'm expecting to get 5 records in results, but getting only one1.png > If I change query to: > {quote} > {code} > SELECT u.name, d.device_info.model as model > FROM test_users u > LEFT OUTER JOIN u.devices d; > {code} > {quote} > I'm getting two records in the results, but still not as it should be. > We found some workaround to this problem. If we add to the result columns > device.id we will get all records from parquet file: > {quote} > {code} > SELECT u.name, d.id, d.device_info.model as model > FROM test_users u > , u.devices d > {code} > {quote} > And result is 3.png > > But we can't rely on this workaround, because we don't need device.id in all > queries and Impala optimizes it, and as a result we are getting unpredicted > results. > > I tested Hive query on this table and it returns expected results: > {quote} > {code} > SELECT u.name, d.device_info.model > FROM test_users u > lateral view outer inline (u.devices) d; > {code} > {quote} > results: > 4.png > Please advice if it's a problem in Impala engine or we did some mistake in > our query. > > Best regards, > Come2Play team. > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org