[jira] [Commented] (SPARK-9340) CatalystSchemaConverter and CatalystRowConverter don't handle unannotated repeated fields correctly

2015-08-10 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680492#comment-14680492
 ] 

Damian Guy commented on SPARK-9340:
---

Code looks good and it works as expected. Tests pass. Thanks for your 
assistance with this.

 CatalystSchemaConverter and CatalystRowConverter don't handle unannotated 
 repeated fields correctly
 ---

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
Assignee: Cheng Lian
 Attachments: ParquetTypesConverterTest.scala


 SPARK-6776 and SPARK-6777 followed {{parquet-avro}} to implement 
 backwards-compatibility rules defined in {{parquet-format}} spec. However, 
 both Spark SQL and {{parquet-avro}} neglected the following statement in 
 {{parquet-format}}:
 {quote}
 This does not affect repeated fields that are not annotated: A repeated field 
 that is neither contained by a {{LIST}}- or {{MAP}}-annotated group nor 
 annotated by {{LIST}} or {{MAP}} should be interpreted as a required list of 
 required elements where the element type is the type of the field.
 {quote}
 One of the consequences is that, Parquet files generated by 
 {{parquet-protobuf}} containing unannotated repeated fields are not correctly 
 converted to Catalyst arrays.
 For example, the following Parquet schema
 {noformat}
 message root {
   repeated int32 f1
 }
 {noformat}
  should be converted to
 {noformat}
 StructType(StructField(f1, ArrayType(IntegerType, containsNull = false), 
 nullable = false) :: Nil)
 {noformat}
 But now it triggers an {{AnalysisException}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-10 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680281#comment-14680281
 ] 

Damian Guy commented on SPARK-9340:
---

Thanks. I'm sure there is a simpler solution to someone more familiar with the 
code! ;-) Thanks for looking further into it, appreciated.

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-10 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679792#comment-14679792
 ] 

Damian Guy commented on SPARK-9340:
---

Hi, I did try it against the 1.5 and the problem still exists - hence the fix. 
You can try it for yourself if you run just the tests I added without making 
the other changes.

The part of the parquet spec that matters in this case is here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types
In particular:
This does not affect repeated fields that are not annotated: A repeated field 
that is neither contained by a LIST- or MAP-annotated group nor annotated by 
LIST or MAP should be interpreted as a required list of required elements where 
the element type is the type of the field.

parquet-protobuf does a 1 - 1 mapping and does not have annotations. It is 
compliant with the spec. 
Whilst i feel the spec should be tighter and the schema should be consistent no 
matter the original data format, this is not the case. 

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-09 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated SPARK-9340:
--
Affects Version/s: 1.5.0

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Damian Guy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Guy updated SPARK-9340:
--
Affects Version/s: 1.3.0

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-08-07 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661831#comment-14661831
 ] 

Damian Guy commented on SPARK-9340:
---

I created a pull request against the 1.3 branch (closest to what i am using) 
https://github.com/apache/spark/pull/8032

 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9340) ParquetTypeConverter incorrectly handling of repeated types results in schema mismatch

2015-07-26 Thread Damian Guy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642035#comment-14642035
 ] 

Damian Guy commented on SPARK-9340:
---

We have protubuf/parquet files (parquet 1.4.3) generated from M/R jobs that 
represent lists like this:
repeated int32 repeated_field;

With Spark 1.2  1.4 we are unable to read these fields as the schema spark 
produces, i.e, optional int32 repeated_field is not correct for these files. It 
looks to me that Spark reads the schema from the file, converts it to its 
internal representation, Attributes, and then coverts the attributes into the 
schema that is used when trying to read. It then fails due to to a schema 
mismatch.


 ParquetTypeConverter incorrectly handling of repeated types results in schema 
 mismatch
 --

 Key: SPARK-9340
 URL: https://issues.apache.org/jira/browse/SPARK-9340
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.4.0
Reporter: Damian Guy
 Attachments: ParquetTypesConverterTest.scala


 The way ParquetTypesConverter handles primitive repeated types results in an 
 incompatible schema being used for querying data. For example, given a schema 
 like so:
 message root {
repeated int32 repeated_field;
  }
 Spark produces a read schema like:
 message root {
optional int32 repeated_field;
  }
 These are incompatible and all attempts to read fail.
 In ParquetTypesConverter.toDataType:
  if (parquetType.isPrimitive) {
   toPrimitiveDataType(parquetType.asPrimitiveType, isBinaryAsString, 
 isInt96AsTimestamp)
 } else {...}
 The if condition should also have 
 !parquetType.isRepetition(Repetition.REPEATED)
  
 And then this case will need to be handled in the else 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org