So the number of mappers depends on a couple of factors (assuming Sqoop import 
on response):

1) The number of data nodes - Sqoop will take your -m# switch and send the 
generated .jar file to that same # of data nodes. So if you have 5 datanodes 
and you set your Sqoop execution to -m30 then you might overrun your Hadoop 
cluster. 

2) How many parallel SQL queries your source RDBMS can handle - Again sending a 
switch of -m30 may completely paralyze a source RDMS because of the concurrent 
load you are requesting. 

3) Skew of data by PK -  Sqoop will take the # of mappers (based on the -m 
switch) and divide it by the MIN and MAX of the --split by column (unless you 
do something special with the --boundary-query switch). So for example a -m4 my 
skew the data badly while even a small change to -m5 or -m6 may have the skew 
looking much better. You can test out the skew by running similar count against 
the source RDBMS or looking that the data files Sqoop creates. 

I hope that helps. 

Thanks,
Brett

> On Jun 21, 2015, at 5:39 AM, sreejesh s <[email protected]> wrote:
> 
> Hi,
> 
> 
>  
> If there is a primary key on the source table, SQOOP import would generate no 
> skewed data... What if there is no primary key defined on the table and we 
> have to use --split-by parameter to split records among multiple mappers.
>  
> There are high chances of skewed data depending on the column we select to 
> --split-by.
>  
> Could you please help me understand how to avoid skewing in such scenarios 
> and also how to determine the optimal number of mappers to be used for any 
> SQOOP import.
> 
> It helps if you can explain how many mappers you have used in your use case 
> along with the size and format of data imported.. 
>  
> Thanks
> 
> 
> 

Reply via email to