Thanks Kate, Very helpful. Is there any chance in the future of having the null substitution handling supported in direct mode? Or is it a fundamental issue with hive itself?
I'll try reducing the number of mappers further.... Jason On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <[email protected]> wrote: > Hi Jason, > > Sorry for the delay. > > Generally speaking, you'll want to decrease num-mappers (default is 4) > to lessen the load on the db (but ingest rate will decline) and you'll > want to increase num-mappers to improve the ingest rate (but db load > will be negatively impacted). Beyond that we can't give any > recommendation with regards to the number of mappers as every > environment is different. The main bottleneck in a Sqoop job is the > shared database system. Sqoop can scale only to the extent that is > allowed by the particular database. Furthermore, this can differ from > table to table even within a single database system depending on what > disks the particular import table uses. > > The MySQL direct connector uses a native utility called mysqldump to > perform a highly efficient data transfer between the MySQL server and > Hadoop cluster. This utility unfortunately does not support using > custom NULL substitution strings and will always import missing values > as a string constant NULL. This is very confusing on the Hive side, as > the Hive shell will display the value as NULL as well. It won't be > perceived as a missing value, but as a valid string constant. You need > to turn off direct mode (by omitting the --direct option) in order to > override the default NULL substitution string. > > Regards, > Kate > > On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <[email protected]> wrote: > > Ping! > > > > > > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <[email protected]> > wrote: > >> > >> Hi Kathleen, Et al, > >> > >> Any thoughts on my parameters above? Would this work for --direct mode > >> likely? > >> > >> Thanks, > >> > >> Jason > >> > >> > >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <[email protected]> > wrote: > >>> > >>> Hi Kathleen, > >>> > >>> The sqoop command looks like this (a bash command, that takes a few > args > >>> naming the source/dest table names, update key, and dest columns). > >>> > >>> sqoop_table () { > >>> sqoop --export > >>> --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \ > >>> --username $MYSQL_USER \ > >>> --outdir $LOCAL_TEMP \ > >>> --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \ > >>> --table $2 \ > >>> --update-key $3 \ > >>> --update-mode allowinsert \ > >>> --columns $4 \ > >>> --input-fields-terminated-by "\t" \ > >>> --input-null-string '\\N' \ > >>> --input-null-non-string '\\N' \ > >>> --batch \ > >>> --num-mappers 8 > >>> } > >>> > >>> The lock contention exceptions seem to be retried, up to a point. If > >>> they happen too often, the hadoop job tracker eventually decides to > kill the > >>> job. Even when the job succeeds, there are usually a few of these > >>> exceptions. I originally had much more mappers configured, but here, > I've > >>> reduced it to 8, and this seems to help (although it's not a > guarantee). > >>> Reducing the num-mappers also makes it less likely that the target > mysql db > >>> machine will get overloaded (e.g. was getting maxed out on cpu usage > with 64 > >>> mappers, etc.). > >>> > >>> The number of records being sqooped is on the order of 1-5M at a time. > >>> > >>> Sounds like I should try --direct mode (but not sure if the > >>> null/delimiters we're using will work with it?). > >>> > >>> Jason > >>> > >>> > >>> > >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <[email protected]> > >>> wrote: > >>>> > >>>> Hi Jason, > >>>> > >>>> Rather than using the JDBC interface for transferring data, the direct > >>>> mode delegates the job of transferring data to the native utilities > >>>> provided by the database vendor. In the case of MySQL, the mysqldump > >>>> and mysqlimport will be used for retrieving data from the database > >>>> server or moving data back. Using native utilities will greatly > >>>> improve performance, as they are optimized to provide the best > >>>> possible transfer speed while putting less burden on the database > >>>> server. That said, there are several limitations that come with this > >>>> faster import. In the case of MySQL, each node hosting a TaskTracker > >>>> service needs to have both mysqldump and mysqlimport utilities > >>>> installed. Another limitation of the direct mode is that not all > >>>> parameters are supported. As the native utilities usually produce text > >>>> output, binary formats like SequenceFile or Avro won't work. Also, > >>>> parameters that customize the escape characters, type mapping, column > >>>> and row delimiters, or the NULL substitution string might not be > >>>> supported in all cases. > >>>> > >>>> Can you share your entire Sqoop command and the contents of failed > >>>> task attempt attempt_201403180842_0202_m_000002_1? > >>>> > >>>> Thanks, > >>>> Kate > >>>> > >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <[email protected]> > >>>> wrote: > >>>> > Thoughts anyone? > >>>> > > >>>> > Thanks, > >>>> > > >>>> > Jason > >>>> > > >>>> > > >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <[email protected]> > >>>> > wrote: > >>>> >> > >>>> >> Hi, > >>>> >> > >>>> >> I'm wondering if there is expected performance increases with using > >>>> >> the > >>>> >> --direct flag for exporting from hive to mysql. If so, how much > >>>> >> speedup? > >>>> >> > >>>> >> Also, I've been getting lock contention errors during export, and > I'm > >>>> >> wondering if these are less likely using --direct mode? E.g. I'm > >>>> >> getting > >>>> >> these sorts of exceptions on the sqoop console: > >>>> >> > >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id : > >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED > >>>> >> java.io.IOException: Can't export data, please check failed map > task > >>>> >> logs > >>>> >> at > >>>> >> > >>>> >> > org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112) > >>>> >> at > >>>> >> > >>>> >> > org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39) > >>>> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) > >>>> >> at > >>>> >> > >>>> >> > org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64) > >>>> >> at > >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) > >>>> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) > >>>> >> at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > >>>> >> at java.security.AccessController.doPrivileged(Native > Method) > >>>> >> at javax.security.auth.Subject.doAs(Subject.java:415) > >>>> >> at > >>>> >> > >>>> >> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > >>>> >> at org.apache.hadoop.mapred.Child.main(Child.java:262) > >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException: > >>>> >> Deadlock > >>>> >> found when trying to get lock; try restarting transaction > >>>> >> at > >>>> >> > >>>> >> > org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr > >>>> >> > >>>> >> > >>>> >> Thanks, > >>>> >> > >>>> >> Jason > >>>> > > >>>> > > >>> > >>> > >> > > >
