Hi Kathleen, Et al, Any thoughts on my parameters above? Would this work for --direct mode likely?
Thanks, Jason On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <[email protected]> wrote: > Hi Kathleen, > > The sqoop command looks like this (a bash command, that takes a few args > naming the source/dest table names, update key, and dest columns). > > sqoop_table () { > sqoop --export > --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \ > --username $MYSQL_USER \ > --outdir $LOCAL_TEMP \ > --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \ > --table $2 \ > --update-key $3 \ > --update-mode allowinsert \ > --columns $4 \ > --input-fields-terminated-by "\t" \ > --input-null-string '\\N' \ > --input-null-non-string '\\N' \ > --batch \ > --num-mappers 8 > } > > The lock contention exceptions seem to be retried, up to a point. If they > happen too often, the hadoop job tracker eventually decides to kill the > job. Even when the job succeeds, there are usually a few of these > exceptions. I originally had much more mappers configured, but here, I've > reduced it to 8, and this seems to help (although it's not a guarantee). > Reducing the num-mappers also makes it less likely that the target mysql > db machine will get overloaded (e.g. was getting maxed out on cpu usage > with 64 mappers, etc.). > > The number of records being sqooped is on the order of 1-5M at a time. > > Sounds like I should try --direct mode (but not sure if the > null/delimiters we're using will work with it?). > > Jason > > > > On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <[email protected]>wrote: > >> Hi Jason, >> >> Rather than using the JDBC interface for transferring data, the direct >> mode delegates the job of transferring data to the native utilities >> provided by the database vendor. In the case of MySQL, the mysqldump >> and mysqlimport will be used for retrieving data from the database >> server or moving data back. Using native utilities will greatly >> improve performance, as they are optimized to provide the best >> possible transfer speed while putting less burden on the database >> server. That said, there are several limitations that come with this >> faster import. In the case of MySQL, each node hosting a TaskTracker >> service needs to have both mysqldump and mysqlimport utilities >> installed. Another limitation of the direct mode is that not all >> parameters are supported. As the native utilities usually produce text >> output, binary formats like SequenceFile or Avro won't work. Also, >> parameters that customize the escape characters, type mapping, column >> and row delimiters, or the NULL substitution string might not be >> supported in all cases. >> >> Can you share your entire Sqoop command and the contents of failed >> task attempt attempt_201403180842_0202_m_000002_1? >> >> Thanks, >> Kate >> >> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <[email protected]> >> wrote: >> > Thoughts anyone? >> > >> > Thanks, >> > >> > Jason >> > >> > >> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <[email protected]> >> wrote: >> >> >> >> Hi, >> >> >> >> I'm wondering if there is expected performance increases with using the >> >> --direct flag for exporting from hive to mysql. If so, how much >> speedup? >> >> >> >> Also, I've been getting lock contention errors during export, and I'm >> >> wondering if these are less likely using --direct mode? E.g. I'm >> getting >> >> these sorts of exceptions on the sqoop console: >> >> >> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id : >> >> attempt_201403180842_0202_m_000002_1, Status : FAILED >> >> java.io.IOException: Can't export data, please check failed map task >> logs >> >> at >> >> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) >> >> at >> >> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) >> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) >> >> at >> >> >> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) >> >> at >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) >> >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >> >> at java.security.AccessController.doPrivileged(Native Method) >> >> at javax.security.auth.Subject.doAs(Subject.java:415) >> >> at >> >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >> >> at org.apache.hadoop.mapred.Child.main(Child.java:262) >> >> Caused by: java.io.IOException: java.sql.BatchUpdateException: Deadlock >> >> found when trying to get lock; try restarting transaction >> >> at >> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr >> >> >> >> >> >> Thanks, >> >> >> >> Jason >> > >> > >> > >
