Sqoop import: Issue in extracting primary key for 'split-by'

Sachneet Singh Bains Wed, 01 Apr 2015 01:30:35 -0700

Hi,

We are using Sqoop (1.4.5)to import tables from Teradata to Hadoop.
Came across an issue that was breaking the import every time. We debugged and 
resolved it.
Sharing as it may be of some help.


The Scenario and failure:
(I have renamed table and db for explanation)

*         Importing a table named TABLE1 (Primary Key - PK1) from database DB1

*         Not specifying the split by field in the import command.(Sqoop in 
this case fetches the primary key and uses it as the split-by column)

*         The import fails every time saying PK2 does not exist in TABLE1 with 
the following error
ERROR tool.ImportTool: Encountered IOException running import job: 
java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata 
Database] [TeraJDBC 14.10.00.26] [Error 5628] [SQLState HY000] Column PK2 not 
found in DB1.TABLE1.
        at 
org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:170)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)

The Issue and solution:
Issue: The  issue is that in the same Teradata DBMS there exists a table by the 
same name TABLE1 (in different database DB2) with a different primary key -PK2.
Sqoop is using the primary key of this table instead of the correct one. Below 
code is the problem:
org.apache.sqoop.manager.SqlManager.java (line number 559)
getPrimaryKeys(null, null, tableName) returns a result set of pk's for all the 
tables in the DBMS and the code uses the first result without confirming the 
database it belongs to.
In our case the first result is from DB2 though DB1.TABLE1 was expected.
public String getPrimaryKey(String tableName) {
    try {
      DatabaseMetaData metaData = this.getConnection().getMetaData();
      ResultSet results = metaData.getPrimaryKeys(null, null, tableName);
      if (null == results) {
        return null;
      }

      try {
        if (results.next()) {
          return results.getString("COLUMN_NAME");
        } else {
          return null;
        }
Solution:

1.       If we use 'Split-by' explicitly in the import command this code won't 
be executed and thus no issues.

2.       Using single mapper also avoids the issue as no splits are required in 
that case.

3.       Modify the code above to use the database name as well while fetching 
primary key. Parse the database name from the URL and use it while getting the 
primary key:

ResultSet results = metaData.getPrimaryKeys(null, dbName, tableName);

Thanks,
Sachneet









________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

Sqoop import: Issue in extracting primary key for 'split-by'

Reply via email to