[jira] [Updated] (HIVE-12464) Inconsistent behavior between MapReduce and Spark engine on bucketed mapjoin

Nemon Lou (JIRA) Thu, 19 Nov 2015 01:22:32 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nemon Lou updated HIVE-12464:
-----------------------------
    Description: 
Steps to reproduce:
1,prepare the table and data
{noformat}
create table if not exists lxw_test(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table if not exists lxw_test1(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
insert overwrite table lxw_test
values(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
insert overwrite table lxw_test1
values 
(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
set hive.enforce.bucketing;
insert into table lxw_test1 select * from lxw_test;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
{noformat}
2,the following sql will success :
{noformat}
set hive.execution.engine=mr;
explain select  count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}
3,this one will fail :
{noformat}
set hive.execution.engine=spark;
explain select  count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}
On spark,the query returns this error:
{noformat}
Error: Error while compiling statement: FAILED: SemanticException [Error 
10141]: Bucketed table metadata is not correct. Fix the metadata or don't use 
bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of 
buckets for table lxw_test1 is 5, whereas the number of files is 10 
(state=42000,code=10141)
{noformat}
After set hive.ignore.mapjoin.hint=false and use mapjoin hint,the MapReduce 
engine return the same error.
{noformat}
set hive.execution.engine=mr;
set hive.ignore.mapjoin.hint=false;
explain
select /*+ mapjoin(b) */ count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}


  was:
Steps to reproduce:
1,prepare the table and data
{noformat}
create table if not exists lxw_test(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table if not exists lxw_test1(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
set hive.enforce.bucketing = true;
set hive.enforce.sorting = true;
insert overwrite table lxw_test
values(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
insert overwrite table lxw_test1
values 
(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
set hive.enforce.bucketing;
insert into table lxw_test1 select * from lxw_test;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
{noformat}
2,the following sql will success :
{noformat}
set hive.execution.engine=mr;
select  count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}
3,this one will fail :
{noformat}
set hive.execution.engine=spark;
select  count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}
On spark,the query returns this error:
{noformat}
Error: Error while compiling statement: FAILED: SemanticException [Error 
10141]: Bucketed table metadata is not correct. Fix the metadata or don't use 
bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number of 
buckets for table lxw_test1 is 5, whereas the number of files is 10 
(state=42000,code=10141)
{noformat}
After set hive.ignore.mapjoin.hint=false and use mapjoin hint,the MapReduce 
engine return the same error.
{noformat}
set hive.execution.engine=mr;
set hive.ignore.mapjoin.hint=false;
explain
select /*+ mapjoin(b) */ count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei ;
{noformat}



> Inconsistent behavior between MapReduce and Spark engine on bucketed mapjoin
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-12464
>                 URL: https://issues.apache.org/jira/browse/HIVE-12464
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning, Spark
>    Affects Versions: 1.2.1
>            Reporter: Nemon Lou
>
> Steps to reproduce:
> 1,prepare the table and data
> {noformat}
> create table if not exists lxw_test(imei string,sndaid string,data_time 
> string)
> CLUSTERED BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
> create table if not exists lxw_test1(imei string,sndaid string,data_time 
> string)
> CLUSTERED BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
> set hive.enforce.bucketing = true;
> set hive.enforce.sorting = true;
> insert overwrite table lxw_test
> values(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
> insert overwrite table lxw_test1
> values 
> (1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5),(6,6,6),(7,7,7),(8,8,8),(9,9,9),(10,10,10);
> set hive.enforce.bucketing;
> insert into table lxw_test1 select * from lxw_test;
> set hive.optimize.bucketmapjoin = true;
> set hive.optimize.bucketmapjoin.sortedmerge = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> {noformat}
> 2,the following sql will success :
> {noformat}
> set hive.execution.engine=mr;
> explain select  count(1) 
> from lxw_test1 a 
> join lxw_test b 
> on a.imei = b.imei ;
> {noformat}
> 3,this one will fail :
> {noformat}
> set hive.execution.engine=spark;
> explain select  count(1) 
> from lxw_test1 a 
> join lxw_test b 
> on a.imei = b.imei ;
> {noformat}
> On spark,the query returns this error:
> {noformat}
> Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10141]: Bucketed table metadata is not correct. Fix the metadata or don't use 
> bucketed mapjoin, by setting hive.enforce.bucketmapjoin to false. The number 
> of buckets for table lxw_test1 is 5, whereas the number of files is 10 
> (state=42000,code=10141)
> {noformat}
> After set hive.ignore.mapjoin.hint=false and use mapjoin hint,the MapReduce 
> engine return the same error.
> {noformat}
> set hive.execution.engine=mr;
> set hive.ignore.mapjoin.hint=false;
> explain
> select /*+ mapjoin(b) */ count(1) 
> from lxw_test1 a 
> join lxw_test b 
> on a.imei = b.imei ;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12464) Inconsistent behavior between MapReduce and Spark engine on bucketed mapjoin

Reply via email to