[ https://issues.apache.org/jira/browse/SQOOP-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266081#comment-16266081 ]
Yulei Yang commented on SQOOP-3262: ----------------------------------- If a user want to use this patch, the usage is: sqoop import -D org.apache.sqoop.db.type=mysql > Duplicate rows found when split-by column is of type String > ----------------------------------------------------------- > > Key: SQOOP-3262 > URL: https://issues.apache.org/jira/browse/SQOOP-3262 > Project: Sqoop > Issue Type: Bug > Components: connectors/generic > Affects Versions: 1.4.6 > Reporter: Yulei Yang > Attachments: sqoop_3262.patch > > > When using string(or char) type column as split-by column, sometimes we found > duplicate rows, usually this is caused by source RMDBS is case insensitive > when do comparison. Here is a case, (split query sql): > 1. where id >='A' and id < 'E' > 2. where id >='a' and id < 'e' > if the RMDBS is CI, these two different split will get same result, thus > caused duplication. > By default oracle and db2 is CS, but a DBA can change it. so we need to check > it before import. > By default,sql server is CI,solution is --split-by '<your_column> collate > xxx_collation', like Chinese_PRC_Bin. > By default intersystems cachedb is CI,solution is --split-by > "%sqlstring(<your_column>)". > Mysql is CI by default, but “--split-by 'binary <your_column>' ” is throwing > below exception: > ERROR tool.ImportTool: Encountered IOException running import job: > java.io.IOException: Sqoop does not have the splitter for the given SQL data > type. Please use either different split column (argument --split-by) or lower > the number of mappers to 1. Unknown SQL data type: -3 > at > org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:165) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at > org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:196) > at > org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:169) > at > org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:266) > at > org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:729) > at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499) > at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605) > at org.apache.sqoop.Sqoop.run(Sqoop.java:143) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179) > at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218) > at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227) > at org.apache.sqoop.Sqoop.main(Sqoop.java:236) > I have apply a patch for mysql's case. -- This message was sent by Atlassian JIRA (v6.4.14#64029)