Hi, I'm trying to run a sqoop export for a large dataset (at least 1B rows) with the following sqoop export call:
sqoop export --direct \ --connect <host> \ --table <table> \ --export-dir /user/hive/warehouse/<table> \ --num-mappers 8 \ --username <username> \ --password <password> \ --input-fields-terminated-by ',' \ --verbose Behind the scenes, I've found that sqoop export does what you'd expect it to: it farms out the work to a (num-mappers) number of different nodes with a NodeManager role, gets about 32MB worth of HDFS data into a temp file on each of the nodes and sends it along to mysqlimport which generates a LOAD DATA LOCAL INFILE for the tempfile into the MySQL table. The following error occurs depending on the level of parallelism used (via num-mappers), that is, 2 mappers doesn't trigger it but 10 definitely will: 14/09/14 17:34:25 INFO mapreduce.Job: map 25% reduce 0% 14/09/14 17:34:27 INFO mapreduce.Job: Task Id : attempt_1410271365435_0034_m_000000_0, Status : FAILED Error: java.io.IOException: mysqlimport terminated with error code 1 at org.apache.sqoop.mapreduce.MySQLExportMapper.closeExportHandles(MySQLExportMapper.java:313) at org.apache.sqoop.mapreduce.MySQLExportMapper.run(MySQLExportMapper.java:250) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) I understand there is some limit to the level of parallelism that can be achieved in the job -- mysqld can get tied up processing too many things at once etc. but I'd like to know how to turn the debugging on for the org.apache.sqoop.mapreduce.MySQLExportMapper class so that I can actually see the mysqlimport error. Reading through the following code[0] (not sure if this is the relevant version BTW), I see that a logger is set up that should be giving a lot of information[1] about the mysqlimport calls, but I don't seem to be getting any of this fun in my logs. [0] https://svn.apache.org/repos/asf/sqoop/trunk/src/java/org/apache/sqoop/mapreduce/MySQLExportMapper.java [1] `LOG.debug("Starting mysqlimport with arguments:");` Additional info: I have log4j.properties setup in the following basic way: log4j.rootLogger=${root.logger} root.logger=INFO,console log4j.logger.org.apache.hadoop.mapred=TRACE log4j.logger.org.apache.sqoop.mapreduce=TRACE log4j.logger.org.apache.sqoop.mapreduce.MySQLExportMapper=TRACE log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n What I have found is that the `max_allowed_packet` setting in MySQL seems to affect this behaviour somewhat but I'd rather get more information about the actual error than attempt to tweak a setting "blind". Relevant versioning: Cloudera Hadoop Distribution (5.1.2) mysqlimport: Ver 3.7 Distrib 5.5.38, for debian-linux-gnu sqoop version: 1.4.4 Kind regards, Christian Verkerk -- Christian Verkerk Software Engineer, Tubular Labs [email protected]
