[ https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Igor Guzenko resolved DRILL-5183. --------------------------------- Resolution: Fixed Fixed in DRILL-7268. > Drill doesn't seem to handle array values correctly in Parquet files > -------------------------------------------------------------------- > > Key: DRILL-5183 > URL: https://issues.apache.org/jira/browse/DRILL-5183 > Project: Apache Drill > Issue Type: Bug > Reporter: Dave Kincaid > Assignee: Igor Guzenko > Priority: Major > Attachments: books.parquet > > > It looks to me that Drill is not properly converting array values in Parquet > records. I have created a simple example and will attach a simple Parquet > file to this issue. If I write Parquet records using the Avro schema > {code:title=Book.avsc} > { "type": "record", > "name": "Book", > "fields": [ > { "name": "title", "type": "string" }, > { "name": "pages", "type": "int" }, > { "name": "authors", "type": {"type": "array", "items": "string"} } > ] > } > {code} > I write two records using this schema into the attached Parquet file and then > simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result: > ||title||pages||authors|| > |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}| > |Foundations of Mathematical Analysis|428|{"array":["Richard > Johnsonbaugh","W.E. Pfaffenberger"]}| > You can see that the authors column seems to be a nested record with the name > "array" instead of being a repeated value. If I change the SQL query to > {{SELECT title,pages,t.authors.`array` FROM > dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then > I get: > ||title||pages||EXPR$2|| > |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]| > |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. > Pfaffenberger"]| > and now that column behaves in Drill as a repeated values column. -- This message was sent by Atlassian Jira (v8.3.4#803005)