This is just a tweak for your scenario: add this option to your sqoop command: --boundary-query 'select min(mapid), max(mapid) + 1 from <table_name>'
Let me know if that doesn't work. Thanks, Abhijeet On 30 Aug 2012 21:43, "Erik Knoll" <[email protected]> wrote: > I'm using Sqoop 1.4.1 to import a table from MySQL to HDFS. The table > contains log entries by users who are identified by an integer user ID > but does not have a primary key. Because of the way user ID's were > assigned, lower value ID's have more records in the table than larger > ID's making parallel imports extremely unbalanced (I'm only running 7 > map tasks). > > In order balance the parallel import, I created an additional column > which I set to be 'mapid = UserID mod 7' producing values 0 - 6 which > do have a uniform distribution of records. When I run the Sqoop import > with '--split-by mapid -m 7' the job seems to be limited to 6 map > tasks. This same behavior is exhibited even if I add 1 to my 'mapid' > column so I'm thinking Sqoop is limiting the map tasks to the > difference between the minimum and maximum values of the split-by > column without adding 1 to the range. > > I know that I can create a different 'mapid' column or create a > primary key, but is there something I can do with Sqoop to correct for > this? > > Thank you, > Erik >
