Sqoop split-by column limiting map tasks

Erik Knoll Thu, 30 Aug 2012 09:13:49 -0700

I'm using Sqoop 1.4.1 to import a table from MySQL to HDFS. The table
contains log entries by users who are identified by an integer user ID
but does not have a primary key. Because of the way user ID's were
assigned, lower value ID's have more records in the table than larger
ID's making parallel imports extremely unbalanced (I'm only running 7
map tasks).


In order balance the parallel import, I created an additional column
which I set to be 'mapid = UserID mod 7' producing values 0 - 6 which
do have a uniform distribution of records. When I run the Sqoop import
with '--split-by mapid -m 7' the job seems to be limited to 6 map
tasks. This same behavior is exhibited even if I add 1 to my 'mapid'
column so I'm thinking Sqoop is limiting the map tasks to the
difference between the minimum and maximum values of the split-by
column without adding 1 to the range.

I know that I can create a different 'mapid' column or create a
primary key, but is there something I can do with Sqoop to correct for
this?

Thank you,
Erik

Sqoop split-by column limiting map tasks

Reply via email to