[ https://issues.apache.org/jira/browse/SPARK-17195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432314#comment-15432314 ]
Jason Moore commented on SPARK-17195: ------------------------------------- That's right. The JDBC API has ResultSetMetaData.isNullable returning: * ResultSetMetaData.columnNoNulls (= 0) which means the column does not allow NULL values * ResultSetMetaData.columnNullable (= 1) which means the column allows NULL values * ResultSetMetaData.columnNullableUnknown (= 2) which means the nullability of a column's values is unknown In Spark we take this result and do as you've described: If something is not non null then it is nullable. See first link in the ticket description above. > Dealing with JDBC column nullability when it is not reliable > ------------------------------------------------------------ > > Key: SPARK-17195 > URL: https://issues.apache.org/jira/browse/SPARK-17195 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Jason Moore > > Starting with Spark 2.0.0, the column "nullable" property is important to > have correct for the code generation to work properly. Marking the column as > nullable = false used to (<2.0.0) allow null values to be operated on, but > now this will result in: > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > {noformat} > I'm all for the change towards a more ridged behavior (enforcing correct > input). But the problem I'm facing now is that when I used JDBC to read from > a Teradata server, the column nullability is often not correct (particularly > when sub-queries are involved). > This is the line in question: > https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140 > I'm trying to work out what would be the way forward for me on this. I know > that it's really the fault of the Teradata database server not returning the > correct schema, but I'll need to make Spark itself or my application > resilient to this behavior. > One of the Teradata JDBC Driver tech leads has told me that "when the > rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length > string, then the other metadata values may not be completely accurate" - so > one option could be to treat the nullability (at least) the same way as the > "unknown" case (as nullable = true). For reference, see the rest of our > discussion here: > http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability > Any other thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org