[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680492#comment-14680492 ] Damian Guy commented on SPARK-9340: --- Code looks good and it works as expected. Tests pass. Thanks for your assistance with this. CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly --- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Assignee: Cheng Lian Attachments: ParquetTypesConverterTest.scala SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement backwards-compatibility rules defined in {{parquet-format}} spec. However, both Spark SQL and {{parquet-avro}} neglected the following statement in {{parquet-format}}: {quote} This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of required elements where the element type is the type of the field. {quote} One of the consequences is that, Parquet files generated by {{parquet-protobuf}} containing unannotated repeated fields are not correctly converted to Catalyst arrays. For example, the following Parquet schema {noformat} message root { repeated int32 f1 } {noformat} should be converted to {noformat} StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), nullable = false) :: Nil) {noformat} But now it triggers an {{AnalysisException}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680281#comment-14680281 ] Damian Guy commented on SPARK-9340: --- Thanks. I'm sure there is a simpler solution to someone more familiar with the code! ;-) Thanks for looking further into it, appreciated. ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679792#comment-14679792 ] Damian Guy commented on SPARK-9340: --- Hi, I did try it against the 1.5 and the problem still exists - hence the fix. You can try it for yourself if you run just the tests I added without making the other changes. The part of the parquet spec that matters in this case is here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types In particular: This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. parquet-protobuf does a 1 - 1 mapping and does not have annotations. It is compliant with the spec. Whilst i feel the spec should be tighter and the schema should be consistent no matter the original data format, this is not the case. ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damian Guy updated SPARK-9340: -- Affects Version/s: 1.5.0 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damian Guy updated SPARK-9340: -- Affects Version/s: 1.3.0 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661831#comment-14661831 ] Damian Guy commented on SPARK-9340: --- I created a pull request against the 1.3 branch (closest to what i am using) https://github.com/apache/spark/pull/8032 ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch
[ https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642035#comment-14642035 ] Damian Guy commented on SPARK-9340: --- We have protubuf/parquet files (parquet 1.4.3) generated from M/R jobs that represent lists like this: repeated int32 repeated_field; With Spark 1.2 1.4 we are unable to read these fields as the schema spark produces, i.e, optional int32 repeated_field is not correct for these files. It looks to me that Spark reads the schema from the file, converts it to its internal representation, Attributes, and then coverts the attributes into the schema that is used when trying to read. It then fails due to to a schema mismatch. ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch -- Key: SPARK-9340 URL: https://issues.apache.org/jira/browse/SPARK-9340 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.4.0 Reporter: Damian Guy Attachments: ParquetTypesConverterTest.scala The way ParquetTypesConverter handles primitive repeated types results in an incompatible schema being used for querying data. For example, given a schema like so: message root { repeated int32 repeated_field; } Spark produces a read schema like: message root { optional int32 repeated_field; } These are incompatible and all attempts to read fail. In ParquetTypesConverter.toDataType: if (parquetType.isPrimitive) { toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, isInt96AsTimestamp) } else {...} The if condition should also have !parquetType.isRepetition(Repetition.REPEATED) And then this case will need to be handled in the else -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org