Nannan Yu created SPARK-22706:
---------------------------------

             Summary: Cannot read Teradata CLOB column type correctly in Spark 
2.2.0
                 Key: SPARK-22706
                 URL: https://issues.apache.org/jira/browse/SPARK-22706
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Spark 2.2.0 
Teradata driver version: 16.20.00.02 (terajdbc4.jar)
I am testing with Mac
            Reporter: Nannan Yu


When I try to read the table from Teradata dababase with URL like 
"jdbc:teradata://IP/Database".
It works well for most of the column datatype.

But for the CLOB datatype, it cannot read the correct contest: 
 If I am using Spark:  it will get some result like this:  
   -----------------------------------------------------------------------
   DT_INT |  DT_CLOB   
   
------------------------------------------------------------------------------
     1         | com.teradata.jdbc.jdk6_SQL_Clob@3cc9ee67
     2         | com.teradata.jdbc.jdk6_SQL_Clob@7cf9ecad
   
--------------------------------------------------------------------------------

Actually, it should show the real Character Large Object  as this:
  -----------------------------------------------------------------------
   DT_INT |  DT_CLOB   
   
------------------------------------------------------------------------------
     1         | JESSIE
     2         | CATHY
   
--------------------------------------------------------------------------------

Note: This issue is not for all the 'CLOB' column datatype, for example, 
DB2-CLOB datatype works fine with Spark. 
For teradata, we have this issue maybe related to this reason: 
  https://www.dwhpro.com/teradata-clob/
   " The data of a CLOB column is stored {color:#f6c342}as a separate 
sub-table{color}. " for TeraData CLOB? 

>From the Spark SourceCode in the file: 
>org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
     * case StringType* =>
      (rs: ResultSet, row: InternalRow, pos: Int) =>
        // TODO(davies): use getBytes for better performance, if the encoding 
is UTF-8
        row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))

It seems we cannot use this way for the CLOB column from Teradata.

Actually, I tried to use some other way to read the CLOB data from Teradata, it 
works well. Also it also works for the DB2-CLOB data. 

   case StringType =>
      (rs: ResultSet, row: InternalRow, pos: Int) => {

        val sqlType = rs.getMetaData.getColumnType(pos + 1)
        if (java.sql.Types.CLOB == sqlType) {
          val clobData = rs.getClob(pos + 1);
          if (clobData != null)
            row.update(pos, UTF8String.fromString(clobData.getSubString(1, 
clobData.length().toInt)))
          else
            row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
        }
        else {
          // TODO(davies): use getBytes for better performance, if the encoding 
is UTF-8
          row.update(pos, UTF8String.fromString(rs.getString(pos + 1)))
        }
      }

Should we handle the CLOB datatype differently with the normal 'StringType'? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to