Re: Re: the confusion of --split-by parameter

Abraham Elmahrek Tue, 09 Sep 2014 19:00:45 -0700

Good point. The only thing I can think of is that offsets might be slower
(since the DB has to scan and keep a count internally) and the expectation
that certain ranges of data end up in certain files (though I doubt this
one). I'll defer this one to the broader community as I'm not sure myself.


On Tue, Sep 9, 2014 at 5:31 PM, [email protected] <
[email protected]> wrote:

>
>
> Hey,brother.
>   Glad to hear from you!I think we can use limit/offset(if the database
> support this operation),or we can use sub-selection(if the database does
> not support limint/offset)
> For example:
> For MySQL:select * from table limiit 0,5;select * from table limit 6,10...
> For Oracle we can use rownum
> I just can not understand why sqoop override this opreation above.This
> override can lead to data skew.
>
>
> *From:* Abraham Elmahrek <[email protected]>
> *Date:* 2014-09-10 00:38
> *To:* [email protected]
> *Subject:* Re: the confusion of --split-by parameter
> Hey there,
>
> For databases, there needs to be a way to actually infer boundaries for a
> particular column. Simply performing a "select *" would not be enough
> because we would not know how to query the database.
>
> -Abe
>
> On Mon, Sep 8, 2014 at 8:33 PM, [email protected] <
> [email protected]> wrote:
>
>> Hi,all.
>>    In sqoop we can specify the parameter --split-by,which can determine
>> which field we will use to split map recored.
>> But if the split field's data is skew.The workload between maps will be 
>> imbalance.I
>> want to know why sqoop does not use
>> select count(*) from table/num-maps to determine each map's workload.As I
>> know some other base class of  DataDrivenDBInputFormat's
>> has the implementation of select count(*) from table/num-maps.Then why
>> sqoop override this.
>>
>>
>>
>

Re: Re: the confusion of --split-by parameter

Reply via email to