Re: direct mode for mysql export vs. default jdbc

Kathleen Ting Thu, 27 Mar 2014 17:25:53 -0700

Hi Jason, unfortunately that is a limitation with the mysqldump utility itself.


Did reducing the number of mappers help?

Thanks,
Kate

On Tue, Mar 25, 2014 at 7:38 PM, Jason Rosenberg <[email protected]> wrote:
> Thanks Kate,
>
> Very helpful.  Is there any chance in the future of having the null
> substitution handling supported in direct mode?  Or is it a fundamental
> issue with hive itself?
>
> I'll try reducing the number of mappers further....
>
> Jason
>
>
> On Tue, Mar 25, 2014 at 8:26 PM, Kathleen Ting <[email protected]> wrote:
>>
>> Hi Jason,
>>
>> Sorry for the delay.
>>
>> Generally speaking, you'll want to decrease num-mappers (default is 4)
>> to lessen the load on the db (but ingest rate will decline) and you'll
>> want to increase num-mappers to improve the ingest rate (but db load
>> will be negatively impacted). Beyond that we can't give any
>> recommendation with regards to the number of mappers as every
>> environment is different. The main bottleneck in a Sqoop job is the
>> shared database system. Sqoop can scale only to the extent that is
>> allowed by the particular database. Furthermore, this can differ from
>> table to table even within a single database system depending on what
>> disks the particular import table uses.
>>
>> The MySQL direct connector uses a native utility called mysqldump to
>> perform a highly efficient data transfer between the MySQL server and
>> Hadoop cluster. This utility unfortunately does not support using
>> custom NULL substitution strings and will always import missing values
>> as a string constant NULL. This is very confusing on the Hive side, as
>> the Hive shell will display the value as NULL as well. It won't be
>> perceived as a missing value, but as a valid string constant. You need
>> to turn off direct mode (by omitting the --direct option) in order to
>> override the default NULL substitution string.
>>
>> Regards,
>> Kate
>>
>> On Tue, Mar 25, 2014 at 8:26 AM, Jason Rosenberg <[email protected]> wrote:
>> > Ping!
>> >
>> >
>> > On Mon, Mar 24, 2014 at 11:50 AM, Jason Rosenberg <[email protected]>
>> > wrote:
>> >>
>> >> Hi Kathleen, Et al,
>> >>
>> >> Any thoughts on my parameters above?  Would this work for --direct mode
>> >> likely?
>> >>
>> >> Thanks,
>> >>
>> >> Jason
>> >>
>> >>
>> >> On Thu, Mar 20, 2014 at 3:56 PM, Jason Rosenberg <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hi Kathleen,
>> >>>
>> >>> The sqoop command looks like this (a bash command, that takes a few
>> >>> args
>> >>> naming the source/dest table names, update key, and dest columns).
>> >>>
>> >>> sqoop_table () {
>> >>>   sqoop --export
>> >>>     --connect jdbc:mysql://$MYSQL_HOST/$MYSQL_DATABASE \
>> >>>     --username $MYSQL_USER \
>> >>>     --outdir $LOCAL_TEMP \
>> >>>     --export-dir /user/hive/warehouse/$HIVE_DATABASE.db/$1 \
>> >>>     --table $2 \
>> >>>     --update-key $3 \
>> >>>     --update-mode allowinsert \
>> >>>     --columns $4 \
>> >>>     --input-fields-terminated-by "\t" \
>> >>>     --input-null-string '\\N' \
>> >>>     --input-null-non-string '\\N' \
>> >>>     --batch \
>> >>>     --num-mappers 8
>> >>> }
>> >>>
>> >>> The lock contention exceptions seem to be retried, up to a point.  If
>> >>> they happen too often, the hadoop job tracker eventually decides to
>> >>> kill the
>> >>> job.  Even when the job succeeds, there are usually a few of these
>> >>> exceptions.  I originally had much more mappers configured, but here,
>> >>> I've
>> >>> reduced it to 8, and this seems to help (although it's not a
>> >>> guarantee).
>> >>> Reducing the num-mappers also makes it less likely that the target
>> >>> mysql db
>> >>> machine will get overloaded (e.g. was getting maxed out on cpu usage
>> >>> with 64
>> >>> mappers, etc.).
>> >>>
>> >>> The number of records being sqooped is on the order of 1-5M at a time.
>> >>>
>> >>> Sounds like I should try --direct mode (but not sure if the
>> >>> null/delimiters we're using will work with it?).
>> >>>
>> >>> Jason
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Mar 20, 2014 at 12:46 PM, Kathleen Ting <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Hi Jason,
>> >>>>
>> >>>> Rather than using the JDBC interface for transferring data, the
>> >>>> direct
>> >>>> mode delegates the job of transferring data to the native utilities
>> >>>> provided by the database vendor. In the case of MySQL, the mysqldump
>> >>>> and mysqlimport will be used for retrieving data from the database
>> >>>> server or moving data back. Using native utilities will greatly
>> >>>> improve performance, as they are optimized to provide the best
>> >>>> possible transfer speed while putting less burden on the database
>> >>>> server. That said, there are several limitations that come with this
>> >>>> faster import. In the case of MySQL, each node hosting a TaskTracker
>> >>>> service needs to have both mysqldump and mysqlimport utilities
>> >>>> installed. Another limitation of the direct mode is that not all
>> >>>> parameters are supported. As the native utilities usually produce
>> >>>> text
>> >>>> output, binary formats like SequenceFile or Avro won't work. Also,
>> >>>> parameters that customize the escape characters, type mapping, column
>> >>>> and row delimiters, or the NULL substitution string might not be
>> >>>> supported in all cases.
>> >>>>
>> >>>> Can you share your entire Sqoop command and the contents of failed
>> >>>> task attempt attempt_201403180842_0202_m_000002_1?
>> >>>>
>> >>>> Thanks,
>> >>>> Kate
>> >>>>
>> >>>> On Thu, Mar 20, 2014 at 8:24 AM, Jason Rosenberg <[email protected]>
>> >>>> wrote:
>> >>>> > Thoughts anyone?
>> >>>> >
>> >>>> > Thanks,
>> >>>> >
>> >>>> > Jason
>> >>>> >
>> >>>> >
>> >>>> > On Tue, Mar 18, 2014 at 2:23 PM, Jason Rosenberg <[email protected]>
>> >>>> > wrote:
>> >>>> >>
>> >>>> >> Hi,
>> >>>> >>
>> >>>> >> I'm wondering if there is expected performance increases with
>> >>>> >> using
>> >>>> >> the
>> >>>> >> --direct flag for exporting from hive to mysql.  If so, how much
>> >>>> >> speedup?
>> >>>> >>
>> >>>> >> Also, I've been getting lock contention errors during export, and
>> >>>> >> I'm
>> >>>> >> wondering if these are less likely using --direct mode?  E.g. I'm
>> >>>> >> getting
>> >>>> >> these sorts of exceptions on the sqoop console:
>> >>>> >>
>> >>>> >> 14/03/18 14:44:15 INFO mapred.JobClient: Task Id :
>> >>>> >> attempt_201403180842_0202_m_000002_1, Status : FAILED
>> >>>> >> java.io.IOException: Can't export data, please check failed map
>> >>>> >> task
>> >>>> >> logs
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
>> >>>> >>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
>> >>>> >>         at
>> >>>> >> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
>> >>>> >>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>> >>>> >>         at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>> >>>> >>         at java.security.AccessController.doPrivileged(Native
>> >>>> >> Method)
>> >>>> >>         at javax.security.auth.Subject.doAs(Subject.java:415)
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>> >>>> >>         at org.apache.hadoop.mapred.Child.main(Child.java:262)
>> >>>> >> Caused by: java.io.IOException: java.sql.BatchUpdateException:
>> >>>> >> Deadlock
>> >>>> >> found when trying to get lock; try restarting transaction
>> >>>> >>         at
>> >>>> >>
>> >>>> >>
>> >>>> >> org.apache.sqoop.mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWr
>> >>>> >>
>> >>>> >>
>> >>>> >> Thanks,
>> >>>> >>
>> >>>> >> Jason
>> >>>> >
>> >>>> >
>> >>>
>> >>>
>> >>
>> >
>
>

Re: direct mode for mysql export vs. default jdbc

Reply via email to