Jason Moore created SPARK-17195:
-----------------------------------

             Summary: Dealing with JDBC column nullability when it is not 
reliable
                 Key: SPARK-17195
                 URL: https://issues.apache.org/jira/browse/SPARK-17195
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Jason Moore


Starting with Spark 2.0.0, the column "nullable" property is important to have 
correct for the code generation to work properly.  Marking the column as 
nullable = false used to (<2.0.0) allow null values to be operated on, but now 
this will result in:

{noformat}
Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
{noformat}

I'm all for the change towards a more ridged behavior (enforcing correct 
input).  But the problem I'm facing now is that when I used JDBC to read from a 
Teradata server, the column nullability is often not correct (particularly when 
sub-queries are involved).

This is the line in question:
https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140

I'm trying to work out what would be the way forward for me on this.  I know 
that it's really the fault of the Teradata database server not returning the 
correct schema, but I'll need to make Spark itself or my application resilient 
to this behavior.

One of the Teradata JDBC Driver tech leads has told me that "when the 
rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length 
string, then the other metadata values may not be completely accurate" - so one 
option could be to treat the nullability (at least) the same way as the 
"unknown" case (as nullable = true).  For reference, see the rest of our 
discussion here: 
http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability

Any other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to