Hi Jason, unfortunately that is a limitation with the mysqldump utility itself.
Did reducing the number of mappers help? Thanks, Kate On Tue, Mar 25, 2014 at 7:38 PM, Jason Rosenberg <[email protected]> wrote: > Thanks Kate, > > Very helpful. Is there any chance in the future of having the null > substitution handling supported in direct mode? Or is it a fundamental > issue with hive itself? > > I'll try reducing the number of mappers further.... > > Jason > > > On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <[email protected]> wrote: >> >> Hi Jason, >> >> Sorry for the delay. >> >> Generally speaking, you'll want to decrease num-mappers (default is 4) >> to lessen the load on the db (but ingest rate will decline) and you'll >> want to increase num-mappers to improve the ingest rate (but db load >> will be negatively impacted). Beyond that we can't give any >> recommendation with regards to the number of mappers as every >> environment is different. The main bottleneck in a Sqoop job is the >> shared database system. Sqoop can scale only to the extent that is >> allowed by the particular database. Furthermore, this can differ from >> table to table even within a single database system depending on what >> disks the particular import table uses. >> >> The MySQL direct connector uses a native utility called mysqldump to >> perform a highly efficient data transfer between the MySQL server and >> Hadoop cluster. This utility unfortunately does not support using >> custom NULL substitution strings and will always import missing values >> as a string constant NULL. This is very confusing on the Hive side, as >> the Hive shell will display the value as NULL as well. It won't be >> perceived as a missing value, but as a valid string constant. You need >> to turn off direct mode (by omitting the --direct option) in order to >> override the default NULL substitution string. >> >> Regards, >> Kate >> >> On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <[email protected]> wrote: >> > Ping! >> > >> > >> > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <[email protected]> >> > wrote: >> >> >> >> Hi Kathleen, Et al, >> >> >> >> Any thoughts on my parameters above? Would this work for --direct mode >> >> likely? >> >> >> >> Thanks, >> >> >> >> Jason >> >> >> >> >> >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <[email protected]> >> >> wrote: >> >>> >> >>> Hi Kathleen, >> >>> >> >>> The sqoop command looks like this (a bash command, that takes a few >> >>> args >> >>> naming the source/dest table names, update key, and dest columns). >> >>> >> >>> sqoop_table () { >> >>> sqoop --export >> >>> --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \ >> >>> --username $MYSQL_USER \ >> >>> --outdir $LOCAL_TEMP \ >> >>> --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \ >> >>> --table $2 \ >> >>> --update-key $3 \ >> >>> --update-mode allowinsert \ >> >>> --columns $4 \ >> >>> --input-fields-terminated-by "\t" \ >> >>> --input-null-string '\\N' \ >> >>> --input-null-non-string '\\N' \ >> >>> --batch \ >> >>> --num-mappers 8 >> >>> } >> >>> >> >>> The lock contention exceptions seem to be retried, up to a point. If >> >>> they happen too often, the hadoop job tracker eventually decides to >> >>> kill the >> >>> job. Even when the job succeeds, there are usually a few of these >> >>> exceptions. I originally had much more mappers configured, but here, >> >>> I've >> >>> reduced it to 8, and this seems to help (although it's not a >> >>> guarantee). >> >>> Reducing the num-mappers also makes it less likely that the target >> >>> mysql db >> >>> machine will get overloaded (e.g. was getting maxed out on cpu usage >> >>> with 64 >> >>> mappers, etc.). >> >>> >> >>> The number of records being sqooped is on the order of 1-5M at a time. >> >>> >> >>> Sounds like I should try --direct mode (but not sure if the >> >>> null/delimiters we're using will work with it?). >> >>> >> >>> Jason >> >>> >> >>> >> >>> >> >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <[email protected]> >> >>> wrote: >> >>>> >> >>>> Hi Jason, >> >>>> >> >>>> Rather than using the JDBC interface for transferring data, the >> >>>> direct >> >>>> mode delegates the job of transferring data to the native utilities >> >>>> provided by the database vendor. In the case of MySQL, the mysqldump >> >>>> and mysqlimport will be used for retrieving data from the database >> >>>> server or moving data back. Using native utilities will greatly >> >>>> improve performance, as they are optimized to provide the best >> >>>> possible transfer speed while putting less burden on the database >> >>>> server. That said, there are several limitations that come with this >> >>>> faster import. In the case of MySQL, each node hosting a TaskTracker >> >>>> service needs to have both mysqldump and mysqlimport utilities >> >>>> installed. Another limitation of the direct mode is that not all >> >>>> parameters are supported. As the native utilities usually produce >> >>>> text >> >>>> output, binary formats like SequenceFile or Avro won't work. Also, >> >>>> parameters that customize the escape characters, type mapping, column >> >>>> and row delimiters, or the NULL substitution string might not be >> >>>> supported in all cases. >> >>>> >> >>>> Can you share your entire Sqoop command and the contents of failed >> >>>> task attempt attempt_201403180842_0202_m_000002_1? >> >>>> >> >>>> Thanks, >> >>>> Kate >> >>>> >> >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <[email protected]> >> >>>> wrote: >> >>>> > Thoughts anyone? >> >>>> > >> >>>> > Thanks, >> >>>> > >> >>>> > Jason >> >>>> > >> >>>> > >> >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <[email protected]> >> >>>> > wrote: >> >>>> >> >> >>>> >> Hi, >> >>>> >> >> >>>> >> I'm wondering if there is expected performance increases with >> >>>> >> using >> >>>> >> the >> >>>> >> --direct flag for exporting from hive to mysql. If so, how much >> >>>> >> speedup? >> >>>> >> >> >>>> >> Also, I've been getting lock contention errors during export, and >> >>>> >> I'm >> >>>> >> wondering if these are less likely using --direct mode? E.g. I'm >> >>>> >> getting >> >>>> >> these sorts of exceptions on the sqoop console: >> >>>> >> >> >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id : >> >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED >> >>>> >> java.io.IOException: Can't export data, please check failed map >> >>>> >> task >> >>>> >> logs >> >>>> >> at >> >>>> >> >> >>>> >> >> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) >> >>>> >> at >> >>>> >> >> >>>> >> >> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) >> >>>> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) >> >>>> >> at >> >>>> >> >> >>>> >> >> >>>> >> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) >> >>>> >> at >> >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) >> >>>> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) >> >>>> >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) >> >>>> >> at java.security.AccessController.doPrivileged(Native >> >>>> >> Method) >> >>>> >> at javax.security.auth.Subject.doAs(Subject.java:415) >> >>>> >> at >> >>>> >> >> >>>> >> >> >>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >> >>>> >> at org.apache.hadoop.mapred.Child.main(Child.java:262) >> >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException: >> >>>> >> Deadlock >> >>>> >> found when trying to get lock; try restarting transaction >> >>>> >> at >> >>>> >> >> >>>> >> >> >>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr >> >>>> >> >> >>>> >> >> >>>> >> Thanks, >> >>>> >> >> >>>> >> Jason >> >>>> > >> >>>> > >> >>> >> >>> >> >> >> > > >
