Re: Strange distribution of keys among mappers

Abraham Elmahrek Wed, 19 Jun 2013 16:32:48 -0700

David,

I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track the
documentation issue. Thanks for bringing this to the community's attention!


-Abe


On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <[email protected]> wrote:

> Hey David,
>
> With oracle, the BigDecimalSplitter will be used by default for all number
> types.
>
> -Abe
>
>
> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <[email protected]>wrote:
>
>> Abe, the database is Oracle.
>>
>>
>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <[email protected]>wrote:
>>
>>> David,
>>>
>>> What database are you importing from? The description I gave was for
>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>> referring to the IntegerSplitter which will add the remainder to the last
>>> value.
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid 
>>> <[email protected]>wrote:
>>>
>>>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>>>> understand your explanation, but it seems to conflict with the Sqoop user
>>>> guide (
>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>> ):
>>>>
>>>> "When performing parallel imports, Sqoop needs a criterion by which it
>>>> can split the workload. Sqoop uses a *splitting column* to split the
>>>> workload. By default, Sqoop will identify the primary key column (if
>>>> present) in a table and use it as the splitting column. The low and high
>>>> values for the splitting column are retrieved from the database, and the
>>>> map tasks operate on evenly-sized components of the total range. For
>>>> example, if you had a table with a primary key column of id whose
>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id <
>>>> hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
>>>> 1001) in the different tasks."
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[email protected]>wrote:
>>>>
>>>>> Hey David,
>>>>>
>>>>> Here's the algorithm:
>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 =
>>>>> 96090027.33... So I'm assuming the .33... is rounded down and split 
>>>>> lengths
>>>>> will be of length 96090027. Sqoop will then create splits with the
>>>>> following points: (min) + (range length)*(n). We can see that 2110 + 
>>>>> 96090027*0
>>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>>>> algorithm. The last point to be added will be 288272192 because the
>>>>> max value is not part of the generated split points. Then sqoop will
>>>>> distributed accordingly based off of these points as you've pointed out
>>>>> above.
>>>>>
>>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>>
>>>>> Hope this helps,
>>>>> -Abe
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>>> running. The minimum key value is 2110 and the maximum value is 
>>>>>> 288272191.
>>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>>> is the distribution among the mappers:
>>>>>>
>>>>>> [2110, 96092137)
>>>>>> [96092137, 192182164)
>>>>>> [192182164, 288272191)
>>>>>> [288272191, 288272192)
>>>>>>
>>>>>> you can see that the fourth mapper is given a range with only one
>>>>>> value in it. Could someone help me understand what is going on?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Dave
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Strange distribution of keys among mappers

Reply via email to