So the number of mappers depends on a couple of factors (assuming Sqoop import on response):
1) The number of data nodes - Sqoop will take your -m# switch and send the generated .jar file to that same # of data nodes. So if you have 5 datanodes and you set your Sqoop execution to -m30 then you might overrun your Hadoop cluster. 2) How many parallel SQL queries your source RDBMS can handle - Again sending a switch of -m30 may completely paralyze a source RDMS because of the concurrent load you are requesting. 3) Skew of data by PK - Sqoop will take the # of mappers (based on the -m switch) and divide it by the MIN and MAX of the --split by column (unless you do something special with the --boundary-query switch). So for example a -m4 my skew the data badly while even a small change to -m5 or -m6 may have the skew looking much better. You can test out the skew by running similar count against the source RDBMS or looking that the data files Sqoop creates. I hope that helps. Thanks, Brett > On Jun 21, 2015, at 5:39 AM, sreejesh s <[email protected]> wrote: > > Hi, > > > > If there is a primary key on the source table, SQOOP import would generate no > skewed data... What if there is no primary key defined on the table and we > have to use --split-by parameter to split records among multiple mappers. > > There are high chances of skewed data depending on the column we select to > --split-by. > > Could you please help me understand how to avoid skewing in such scenarios > and also how to determine the optimal number of mappers to be used for any > SQOOP import. > > It helps if you can explain how many mappers you have used in your use case > along with the size and format of data imported.. > > Thanks > > >
