[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106407#comment-14106407
 ] 

Lianhui Wang commented on HIVE-7384:


i think the thoughts is same as ideas that you said before. like HIVE-7158, 
that will auto-calculate the number of reducers based on some input from Hive 
(upper/lower bound).

> Research into reduce-side join [Spark Branch]
> -
>
> Key: HIVE-7384
> URL: https://issues.apache.org/jira/browse/HIVE-7384
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Szehon Ho
> Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
> sales_products.txt, sales_stores.txt
>
>
> Hive's join operator is very sophisticated, especially for reduce-side join. 
> While we expect that other types of join, such as map-side join and SMB 
> map-side join, will work out of the box with our design, there may be some 
> complication in reduce-side join, which extensively utilizes key tag and 
> shuffle behavior. Our design principle prefers to making Hive implementation 
> work out of box also, which might requires new functionality from Spark. The 
> tasks is to research into this area, identifying requirements for Spark 
> community and the work to be done on Hive to make reduce-side join work.
> A design doc might be needed for this. For more information, please refer to 
> the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106343#comment-14106343
 ] 

Lianhui Wang commented on HIVE-7384:


@Szehon Ho yes,i read OrderedRDDFunctions code and discove that sortByKey 
actually does a range-partition. we need to replace range-partition with hash 
partition. so spark maybe should create a new interface example: 
partitionSortByKey.
@Brock Noland  code in 1) means when sample data and more than one reducers, 
Hive does a total order sort. so join does not sample data, it does not need a 
total order sort.
2) i think we really need auto-parallelism. before i talk it with Reynold Xin, 
spark need to support re-partition mapoutput's data as same as tez does.

> Research into reduce-side join [Spark Branch]
> -
>
> Key: HIVE-7384
> URL: https://issues.apache.org/jira/browse/HIVE-7384
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Szehon Ho
> Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
> sales_products.txt, sales_stores.txt
>
>
> Hive's join operator is very sophisticated, especially for reduce-side join. 
> While we expect that other types of join, such as map-side join and SMB 
> map-side join, will work out of the box with our design, there may be some 
> complication in reduce-side join, which extensively utilizes key tag and 
> shuffle behavior. Our design principle prefers to making Hive implementation 
> work out of box also, which might requires new functionality from Spark. The 
> tasks is to research into this area, identifying requirements for Spark 
> community and the work to be done on Hive to make reduce-side join work.
> A design doc might be needed for this. For more information, please refer to 
> the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7384) Research into reduce-side join [Spark Branch]

2014-08-21 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105301#comment-14105301
 ] 

Lianhui Wang commented on HIVE-7384:


i think current spark already support hash by join_col,sort by {join_col,tag}. 
because in spark map's shuffleWriter hash by Key.hashcode and sort by Key and 
in Hive HiveKey class already define the hashcode. so that can support hash by 
HiveKey.hashcode, sort by HiveKey's bytes

> Research into reduce-side join [Spark Branch]
> -
>
> Key: HIVE-7384
> URL: https://issues.apache.org/jira/browse/HIVE-7384
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Szehon Ho
> Attachments: Hive on Spark Reduce Side Join.docx, sales_items.txt, 
> sales_products.txt, sales_stores.txt
>
>
> Hive's join operator is very sophisticated, especially for reduce-side join. 
> While we expect that other types of join, such as map-side join and SMB 
> map-side join, will work out of the box with our design, there may be some 
> complication in reduce-side join, which extensively utilizes key tag and 
> shuffle behavior. Our design principle prefers to making Hive implementation 
> work out of box also, which might requires new functionality from Spark. The 
> tasks is to research into this area, identifying requirements for Spark 
> community and the work to be done on Hive to make reduce-side join work.
> A design doc might be needed for this. For more information, please refer to 
> the overall design doc on wiki.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Hive insert overwrite strange behavior

2014-07-20 Thread Lianhui Wang
the operator plan of two sql is different.first one:
TableScanOperator--SelectOperator--ReduceOutputOperator--FileSinkOperator--MoveOperator
second one:TableScanOperator--SelectOperator--FetchOperator
in second one,FetchOperator work on client and directly output to local
directory.
but first one, result sink to tmp hdfs and then move tmp hdfs to local
directory.
you can add explain to to sql and then look at operator plan of sql.
example:
explain insert overwrite local directory 'output' select * from test limit
10;


2014-07-16 11:36 GMT+08:00 Azuryy Yu :

> Hi,
>
> I think the following two sql have the same effect.
>
> 1) hive -e "insert overwrite local directory 'output' select * from test
> limit 10;"
> 2) hive -e "select * from test limit 10;" > output
>
>
> but the second one read HDFS directly only takes two seconds, but the first
> one submit a MR job, which has one reduce.
>
> why there is such difference? Thanks.
>



-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


Re: [ANNOUNCE] New Hive PMC Member - Xuefu Zhang

2014-02-28 Thread Lianhui Wang
Congrats Xuefu!


2014-02-28 17:49 GMT+08:00 Jason Dere :

> Congrats Xuefu!
>
>
> On Feb 28, 2014, at 1:43 AM, Biswajit Nayak 
> wrote:
>
> > Congrats Xuefu..
> >
> > With Best Regards
> > Biswajit
> >
> > ~Biswa
> > -oThe important thing is not to stop questioning o-
> >
> >
> > On Fri, Feb 28, 2014 at 2:50 PM, Carl Steinbach  wrote:
> > I am pleased to announce that Xuefu Zhang has been elected to the Hive
> Project Management Committee. Please join me in congratulating Xuefu!
> >
> > Thanks.
> >
> > Carl
> >
> >
> >
> > _
> > The information contained in this communication is intended solely for
> the use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


Re: [ANNOUNCE] New Hive Committer - Xuefu Zhang

2013-11-04 Thread Lianhui Wang
Congrats Xuefu!


2013/11/4 Thejas Nair 

> Congrats Xuefu!
>
> On Sun, Nov 3, 2013 at 11:11 PM, Mohammad Islam 
> wrote:
> > Congrats Xuefu!
> >
> > --Mohammad
> >
> >
> > On Sunday, November 3, 2013 9:07 PM, "hsubramani...@hortonworks.com"
> >  wrote:
> > Congrats Xuefu!
> >
> > Thanks,
> > Hari
> >
> > On Nov 3, 2013, at 8:28 PM, Gunther Hagleitner <
> ghagleit...@hortonworks.com>
> > wrote:
> >
> > Congrats Xuefu!
> >
> > Gunther.
> >
> >
> > On Sun, Nov 3, 2013 at 8:23 PM, Lefty Leverenz
> > wrote:
> >
> > Bravo Xuefu!
> >
> >
> > -- Lefty
> >
> >
> >
> > On Sun, Nov 3, 2013 at 11:09 PM, Zhang Xiaoyu  >
> > wrote:
> >
> >
> > Congratulations! Xuefu, well deserved!
> >
> >
> > Johnny
> >
> >
> >
> > On Sun, Nov 3, 2013 at 8:06 PM, Carl Steinbach  >
> > wrote:
> >
> >
> > The Apache Hive PMC has voted to make Xuefu Zhang a committer on the
> >
> > Apache Hive project.
> >
> >
> > Please join me in congratulating Xuefu!
> >
> >
> > Thanks.
> >
> >
> > Carl
> >
> >
> >
> >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
> >
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the
> reader of
> > this message is not the intended recipient, you are hereby notified that
> any
> > printing, copying, dissemination, distribution, disclosure or forwarding
> of
> > this communication is strictly prohibited. If you have received this
> > communication in error, please contact the sender immediately and delete
> it
> > from your system. Thank You.
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


Re: [ANNOUNCE] New Hive PMC Members - Thejas Nair and Brock Noland

2013-10-24 Thread Lianhui Wang
Congrats Thejas and Brock!


2013/10/25 Navis류승우 

> Congrats!
>
>
> 2013/10/25 Gunther Hagleitner 
>
> > Congrats Thejas and Brock!
> >
> > Thanks,
> > Gunther.
> >
> >
> > On Thu, Oct 24, 2013 at 3:25 PM, Prasad Mujumdar  > >wrote:
> >
> > >
> > >Congratulations Thejas and Brock !
> > >
> > > thanks
> > > Prasad
> > >
> > >
> > >
> > > On Thu, Oct 24, 2013 at 3:10 PM, Carl Steinbach 
> wrote:
> > >
> > >> I am pleased to announce that Thejas Nair and Brock Noland have been
> > >> elected to the Hive Project Management Committee. Please join me in
> > >> congratulating Thejas and Brock!
> > >>
> > >> Thanks.
> > >>
> > >> Carl
> > >>
> > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>



-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


Re: [ANNOUNCE] New Hive Committer - Yin Huai

2013-09-04 Thread Lianhui Wang
Congrats Yin!


2013/9/4 Thejas Nair 

> Congrats Yin!
> Well deserved! Looking forward to many more contributions from you!
>
>
>
> On Tue, Sep 3, 2013 at 11:45 PM, Hari Subramaniyan
>  wrote:
> > Congrats !!
> >
> >
> > On Tue, Sep 3, 2013 at 11:43 PM, Vaibhav Gumashta
> >  wrote:
> >>
> >> Congrats Yin!
> >>
> >>
> >> On Tue, Sep 3, 2013 at 11:37 PM, Jarek Jarcec Cecho 
> >> wrote:
> >>>
> >>> Congratulations Yin!
> >>>
> >>> Jarcec
> >>>
> >>> On Tue, Sep 03, 2013 at 09:49:55PM -0700, Carl Steinbach wrote:
> >>> > The Apache Hive PMC has voted to make Yin Huai a committer on the
> >>> > Apache
> >>> > Hive project.
> >>> >
> >>> > Please join me in congratulating Yin!
> >>> >
> >>> > Thanks.
> >>> >
> >>> > Carl
> >>
> >>
> >>
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to which it is addressed and may contain information that is
> confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader of
> >> this message is not the intended recipient, you are hereby notified
> that any
> >> printing, copying, dissemination, distribution, disclosure or
> forwarding of
> >> this communication is strictly prohibited. If you have received this
> >> communication in error, please contact the sender immediately and
> delete it
> >> from your system. Thank You.
> >
> >
> >
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the
> reader of
> > this message is not the intended recipient, you are hereby notified that
> any
> > printing, copying, dissemination, distribution, disclosure or forwarding
> of
> > this communication is strictly prohibited. If you have received this
> > communication in error, please contact the sender immediately and delete
> it
> > from your system. Thank You.
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
thanks

王联辉(Lianhui Wang)
blog; http://blog.csdn.net/lance_123
兴趣方向:数据库,分布式,数据挖掘,编程语言,互联网技术等


[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized

2013-07-23 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717883#comment-13717883
 ] 

Lianhui Wang commented on HIVE-3430:


Yin Huai,very nice work!

> group by followed by join with the same key should be optimized
> ---
>
> Key: HIVE-3430
> URL: https://issues.apache.org/jira/browse/HIVE-3430
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.10.0
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables

2013-05-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13650380#comment-13650380
 ] 

Lianhui Wang commented on HIVE-4506:


if these have difference column, HIVE-3784 resolved one big table with multiple 
small tables.

> use one map reduce to join multiple small tables 
> -
>
> Key: HIVE-4506
> URL: https://issues.apache.org/jira/browse/HIVE-4506
> Project: Hive
>  Issue Type: Wish
>Affects Versions: 0.10.0
>Reporter: Fern
>Priority: Minor
>
> I know we can use map side join for small table.
> by my test, if I use HQL like this
> --
> select /*+mapjoin(b,c)*/...
> from a
> left join b
> on ...
> left join c
> on ...
> ---
> b and c are both small tables, I expect do the join in one map reduce using 
> map side join. Actually, it would generate two map-reduce jobs by sequence.
> Sorry, currently I am just a user of hive and not dig into the code, so this 
> is what I expect but I have no idea about how to improve now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4506) use one map reduce to join multiple small tables

2013-05-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13650371#comment-13650371
 ] 

Lianhui Wang commented on HIVE-4506:


Fern, can you provide your sql?
if these tables used the same column in join clause, it used one mr.
example:
explain
SELECT /*+mapjoin(src2,src3)*/ src1.key, src3.value FROM src src1 JOIN src src2 
ON (src1.key = src2.key) JOIN src src3 ON (src1.key = src3.key);



> use one map reduce to join multiple small tables 
> -
>
> Key: HIVE-4506
> URL: https://issues.apache.org/jira/browse/HIVE-4506
> Project: Hive
>  Issue Type: Wish
>Affects Versions: 0.10.0
>Reporter: Fern
>Priority: Minor
>
> I know we can use map side join for small table.
> by my test, if I use HQL like this
> --
> select /*+mapjoin(b,c)*/...
> from a
> left join b
> on ...
> left join c
> on ...
> ---
> b and c are both small tables, I expect do the join in one map reduce using 
> map side join. Actually, it would generate two map-reduce jobs by sequence.
> Sorry, currently I am just a user of hive and not dig into the code, so this 
> is what I expect but I have no idea about how to improve now. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4429) Nested ORDER BY produces incorrect result

2013-04-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643512#comment-13643512
 ] 

Lianhui Wang commented on HIVE-4429:


Mihir Kulkarni
sorry, my mistakes. 
Navis
yes, there are some bugs in reducededuplication, i donot confirm that it is 
fixed.but i think the inner order by can be removed in this case.

> Nested ORDER BY produces incorrect result
> -
>
> Key: HIVE-4429
> URL: https://issues.apache.org/jira/browse/HIVE-4429
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, SQL, UDF
>Affects Versions: 0.9.0
> Environment: Red Hat Linux VM with Hive 0.9 and Hadoop 2.0
>Reporter: Mihir Kulkarni
>Priority: Critical
> Attachments: Hive_Command_Script.txt, HiveQuery.txt, Test_Data.txt
>
>
> Nested ORDER BY clause doesn't honor the outer one in specific case.
> The below query produces result which honors only the inner ORDER BY clause. 
> (it produces only 1 MapRed job)
> {code:borderStyle=solid}
> SELECT alias.b0 as d0, alias.b1 as d1
> FROM
> (SELECT test.a0 as b0, test.a1 as b1 
> FROM test
> ORDER BY b1 ASC, b0 DESC) alias
> ORDER BY d0 ASC, d1 DESC;
> {code}
> 
> On the other hand the query below honors the outer ORDER BY clause which 
> produces the correct result. (it produces 2 MapRed jobs)
> {code:borderStyle=solid}
> SELECT alias.b0 as d0, alias.b1 as d1
> FROM
> (SELECT test.a0 as b0, test.a1 as b1 
> FROM test
> ORDER BY b1 ASC, b0 DESC) alias
> ORDER BY d0 DESC, d1 DESC;
> {code}
> 
> Any other combination of nested ORDER BY clauses does produce the correct 
> result.
> Please see attachments for query, schema and Hive Commands for reprocase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4429) Nested ORDER BY produces incorrect result

2013-04-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643453#comment-13643453
 ] 

Lianhui Wang commented on HIVE-4429:


hi, Mihir Kulkarni 
i run the first sql of your cases, but in my hive-0.9, it produces correct 
result.it is the following.
30.01.0
20.01.0
10.01.0
30.02.0
20.02.0
10.02.0
30.03.0
20.03.0
10.03.0
60.04.0
50.04.0
40.04.0
60.05.0
50.05.0
40.05.0
60.06.0
50.06.0
40.06.0

so can you tell which version you used.



> Nested ORDER BY produces incorrect result
> -
>
> Key: HIVE-4429
> URL: https://issues.apache.org/jira/browse/HIVE-4429
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, SQL, UDF
>Affects Versions: 0.9.0
> Environment: Red Hat Linux VM with Hive 0.9 and Hadoop 2.0
>Reporter: Mihir Kulkarni
>Priority: Critical
> Attachments: Hive_Command_Script.txt, HiveQuery.txt, Test_Data.txt
>
>
> Nested ORDER BY clause doesn't honor the outer one in specific case.
> The below query produces result which honors only the inner ORDER BY clause. 
> (it produces only 1 MapRed job)
> {code:borderStyle=solid}
> SELECT alias.b0 as d0, alias.b1 as d1
> FROM
> (SELECT test.a0 as b0, test.a1 as b1 
> FROM test
> ORDER BY b1 ASC, b0 DESC) alias
> ORDER BY d0 ASC, d1 DESC;
> {code}
> 
> On the other hand the query below honors the outer ORDER BY clause which 
> produces the correct result. (it produces 2 MapRed jobs)
> {code:borderStyle=solid}
> SELECT alias.b0 as d0, alias.b1 as d1
> FROM
> (SELECT test.a0 as b0, test.a1 as b1 
> FROM test
> ORDER BY b1 ASC, b0 DESC) alias
> ORDER BY d0 DESC, d1 DESC;
> {code}
> 
> Any other combination of nested ORDER BY clauses does produce the correct 
> result.
> Please see attachments for query, schema and Hive Commands for reprocase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4365) wrong result in left semi join

2013-04-16 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633664#comment-13633664
 ] 

Lianhui Wang commented on HIVE-4365:


hi,ransom
problem also exist in my environment. and i use explain statement and find that 
the second sql's ppd has error.
TableScan
alias: t2
Filter Operator
  predicate:
  expr: (c1 = 1)
  type: boolean

the ppd optimizer push the filter c1='1' to table t1 and t2.
but correct thing is table t1, not t2.


> wrong result in left semi join
> --
>
> Key: HIVE-4365
> URL: https://issues.apache.org/jira/browse/HIVE-4365
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.9.0, 0.10.0
>Reporter: ransom.hezhiqiang
>
> wrong result in left semi join while hive.optimize.ppd=true
> for example:
> 1、create table
>create table t1(c1 int,c2 int, c3 int, c4 int, c5 double,c6 int,c7 string) 
>   row format DELIMITED FIELDS TERMINATED BY '|';
>create table t2(c1 int) ;
> 2、load data
> load data local inpath '/home/test/t1.txt' OVERWRITE into table t1;
> load data local inpath '/home/test/t2.txt' OVERWRITE into table t2;
> t1 data:
> 1|3|10003|52|781.96|555|201203
> 1|3|10003|39|782.96|555|201203
> 1|3|10003|87|783.96|555|201203
> 2|5|10004|24|789.96|555|201203
> 2|5|10004|58|788.96|555|201203
> t2 data:
> 555
> 3、excute Query
> select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7  from t1 left semi join t2 
> on t1.c6 = t2.c1 and  t1.c1 =  '1' and t1.c7 = '201203' ;   
> can got result.
> select t1.c1,t1.c2,t1.c3,t1.c4,t1.c5,t1.c6,t1.c7  from t1 left semi join t2 
> on t1.c6 = t2.c1 where t1.c1 =  '1' and t1.c7 = '201203' ;   
> can't got result.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4137) optimize group by followed by joins for bucketed/sorted tables

2013-03-07 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596690#comment-13596690
 ] 

Lianhui Wang commented on HIVE-4137:


in addition. for bucketed/sorted tables, for single group by operator,it only 
needs map-group by operator and doesnot have reduce-group by operator.
example:
select key,aggr() from T1 group by key.
now plan is
TS-SEL-GBY-RS-GBY-SEL-FS
but that can chang to following plan
TS-SEL-GBY-SEL-FS


> optimize group by followed by joins for bucketed/sorted tables
> --
>
> Key: HIVE-4137
> URL: https://issues.apache.org/jira/browse/HIVE-4137
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>
> Consider the following scenario:
> create table T1 (...) clustered by (key) sorted by (key) into 2 buckets;
> create table T2 (...) clustered by (key) sorted by (key) into 2 buckets;
> create table T3 (...) clustered by (key) sorted by (key) into 2 buckets;
> SET hive.enforce.sorting=true;
> SET hive.enforce.bucketing=true;
> insert overwrite table T3
> select ..
> from 
> (select key, aggr() from T1 group by key) s1
> full outer join
> (select key, aggr() from T2 group by key) s2
> on s1.key=s2.ley;
> Ideally, this query can be performed in a single map-only job.
> Group By -> SortMerge Join.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3963) Allow Hive to connect to RDBMS

2013-03-07 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13596639#comment-13596639
 ] 

Lianhui Wang commented on HIVE-3963:


i think that must support as clause like transform syntax.
for example:
SELECT jdbcload('driver','url','user','password','sql') as c1,c2 FROM dual;

> Allow Hive to connect to RDBMS
> --
>
> Key: HIVE-3963
> URL: https://issues.apache.org/jira/browse/HIVE-3963
> Project: Hive
>  Issue Type: New Feature
>  Components: Import/Export, JDBC, SQL, StorageHandler
>Affects Versions: 0.9.0, 0.10.0, 0.9.1, 0.11.0
>Reporter: Maxime LANCIAUX
>
> I am thinking about something like :
> SELECT jdbcload('driver','url','user','password','sql') FROM dual;
> There is already a JIRA https://issues.apache.org/jira/browse/HIVE-1555 for 
> JDBCStorageHandler

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3430) group by followed by join with the same key should be optimized

2013-03-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590321#comment-13590321
 ] 

Lianhui Wang commented on HIVE-3430:


also should consider the following query:
SELECT a.key, a.cnt, b.key, a.cnt
FROM
(SELECT x.key as key, count(x.value) AS cnt FROM src x group by x.key) a
JOIN src b
ON (a.key = b.key);


> group by followed by join with the same key should be optimized
> ---
>
> Key: HIVE-3430
> URL: https://issues.apache.org/jira/browse/HIVE-3430
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.10.0
>Reporter: Namit Jain
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-27 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589226#comment-13589226
 ] 

Lianhui Wang commented on HIVE-4014:


hi,Tamas
thank you very much,you are right.
also i think rcfile.reader are not very efficient.
the readed column ids are transfer to rcfile.reader.


> Hive+RCFile is not doing column pruning and reading much more data than 
> necessary
> -
>
> Key: HIVE-4014
> URL: https://issues.apache.org/jira/browse/HIVE-4014
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> With even simple projection queries, I see that HDFS bytes read counter 
> doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary

2013-02-25 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586701#comment-13586701
 ] 

Lianhui Wang commented on HIVE-4014:


i donot think that.
i see the code.
in HiveInputFormat and CombineHiveInputFormat's getRecordReader(), it calls 
pushProjectionsAndFilters().
also in pushProjectionsAndFilters(), from TableScanOperator it get needed 
columns and  set these ids to hive.io.file.readcolumn.ids.
and then in RCFile.Reader will read hive.io.file.readcolumn.ids to skip column.
maybe the counter has some mistakes.
if i have mistake,please tell me.thx.

> Hive+RCFile is not doing column pruning and reading much more data than 
> necessary
> -
>
> Key: HIVE-4014
> URL: https://issues.apache.org/jira/browse/HIVE-4014
> Project: Hive
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>
> With even simple projection queries, I see that HDFS bytes read counter 
> doesn't show any reduction in the amount of data read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1643) support range scans and non-key columns in HBase filter pushdown

2012-10-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482080#comment-13482080
 ] 

Lianhui Wang commented on HIVE-1643:


Ashutosh Chauhan 
Is this correct? What about filters on OR conditions and nested filters. Do you 
plan to add support for those ?
select * from tt where col1 < 23 or (col2 < 2 and col3 = 5) or (col4 = 6 and 
(col5 = 3 or col6 = 7));

i think there should need range analyze.
in mysql, sql optimizer include the range analyze on partition and index.
binary tree represent conditions ranges.
but there are some difficulties in task split.
because maybe there are many small ranges in one table region. so maybe merge 
multi small ranges in one region and use rowkeyFilter.
that can reduce one region's visits.



> support range scans and non-key columns in HBase filter pushdown
> 
>
> Key: HIVE-1643
> URL: https://issues.apache.org/jira/browse/HIVE-1643
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.9.0
>Reporter: John Sichi
>Assignee: bharath v
>  Labels: patch
> Attachments: hbase_handler.patch, Hive-1643.2.patch, HIVE-1643.patch
>
>
> HIVE-1226 added support for WHERE rowkey=3.  We would like to support WHERE 
> rowkey BETWEEN 10 and 20, as well as predicates on non-rowkeys (plus 
> conjunctions etc).  Non-rowkey conditions can't be used to filter out entire 
> ranges, but they can be used to push the per-row filter processing as far 
> down as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3420) Inefficiency in hbase handler when process query including rowkey range scan

2012-10-22 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482075#comment-13482075
 ] 

Lianhui Wang commented on HIVE-3420:


@Gong Deng
yes,i agree with you.in InputFormat getRecordReader()
tableSplit = convertFilter(jobConf, scan, tableSplit, iKey,
  getStorageFormatOfKey(columnsMapping.get(iKey).mappingSpec,
  jobConf.get(HBaseSerDe.HBASE_TABLE_DEFAULT_STORAGE_TYPE, "string")));
it have done
tableSplit = new TableSplit(
tableSplit.getTableName(),
startRow,
stopRow,
tableSplit.getRegionLocation(),
tableSplit.getConf());
also in getplits(),a tableSplit lead to a regionLocation task.now that splits 
have not any effect. 
so startRow,stopRow in tableSplit is inside the region row ranges in tableSplit.

IMO,the convertFilter() logic code used in many places.for example:
HBaseStorageHandler.decomposePredicate()
HiveHBaseTableInputFormat.getSplits()
HiveHBaseTableInputFormat.getRecordReader()

i think there need one place to use it. in 
HBaseStorageHandler.decomposePredicate().and that can store row key ranges.
and then 
HiveHBaseTableInputFormat.getSplits(),HiveHBaseTableInputFormat.getRecordReader()
 according to table's regioninfo split the key ranges tasks.

other have ideas?thx.



> Inefficiency in hbase handler when process query including rowkey range scan
> 
>
> Key: HIVE-3420
> URL: https://issues.apache.org/jira/browse/HIVE-3420
> Project: Hive
>  Issue Type: Improvement
>  Components: HBase Handler
>Affects Versions: 0.9.0
> Environment: Hive-0.9.0 + HBase-0.94.1
>Reporter: Gang Deng
>Priority: Critical
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When query hive with hbase rowkey range, hive map tasks do not leverage 
> startrow, endrow information in tablesplit. For example, if the rowkeys fit 
> into 5 hbase files, then where will be 5 map tasks. Ideally, each task will 
> process 1 file. But in current implementation, each task processes 5 files 
> repeatedly. The behavior not only waste network bandwidth, but also worse the 
> lock contention in HBase block cache as each task have to access the same 
> block. The problem code is in HiveHBaseTableInputFormat.convertFilte as below:
> ……
> if (tableSplit != null) {
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> startRow,
> stopRow,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(startRow);
> scan.setStopRow(stopRow);
> ……
> As tableSplit already include startRow, endRow information of file, the 
> better implementation will be:
> ……
> byte[] splitStart = startRow;
> byte[] splitStop = stopRow;
> if (tableSplit != null) {
> 
>if(tableSplit.getStartRow() != null){
> splitStart = startRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getStartRow(), startRow) >= 0 ?
> tableSplit.getStartRow() : startRow;
> }
> if(tableSplit.getEndRow() != null){
> splitStop = (stopRow.length == 0 ||
>   Bytes.compareTo(tableSplit.getEndRow(), stopRow) <= 0) &&
>   tableSplit.getEndRow().length > 0 ?
> tableSplit.getEndRow() : stopRow;
> }   
>   tableSplit = new TableSplit(
> tableSplit.getTableName(),
> splitStart,
> splitStop,
> tableSplit.getRegionLocation());
> }
> scan.setStartRow(splitStart);
> scan.setStopRow(splitStop);
> ……
> In my test, the changed code will improve performance more than 30%.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3561) Build a full SQL-compliant parser for Hive

2012-10-10 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473167#comment-13473167
 ] 

Lianhui Wang commented on HIVE-3561:


for the first approach,there is a problem. standardSQL can not support the 
HiveQL writting in historical.
because there is a big difference in some operators. example:join.
so that maybe spent a lot of time to transfering using hivesql to standardSQL.
in my opinion,in short time,both maybe co-exist.

 

> Build a full SQL-compliant parser for Hive
> --
>
> Key: HIVE-3561
> URL: https://issues.apache.org/jira/browse/HIVE-3561
> Project: Hive
>  Issue Type: Sub-task
>  Components: Query Processor
>Affects Versions: 0.10.0
>Reporter: Shengsheng Huang
>
> To build a full SQL compliant engine on Hive, we'll need a full SQL complant 
> parser. The current Hive parser missed a lot of grammar units from standard 
> SQL. To support full SQL there're possibly four approaches:
> 1.Extend the existing Hive parser to support full SQL constructs. We need to 
> modify the current Hive.g and add any missing grammar units and resolve 
> conflicts. 
> 2.Reuse an existing open source SQL compliant parser and extend it to support 
> Hive extensions. We may need to adapt Semantic Analyzers to the new AST 
> structure.  
> 3.Reuse an existing SQL compliant parser and make it co-exist with the 
> existing Hive parser. Both parsers share the same CliDriver interface. Use a 
> query mode configuration to switch the query mode between SQL and HQL (this 
> is the approach we're now using in the 0.9.0 demo project)
> 4.Reuse an existing SQL compliant parser and make it co-exist with the 
> existing Hive parser. Use a separate xxxCliDriver interface for standard SQL. 
>  
> Let's discuss which is the best approach. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3472) Build An Analytical SQL Engine for MapReduce

2012-09-24 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462391#comment-13462391
 ] 

Lianhui Wang commented on HIVE-3472:


nexr have done some works in oracle sql to hive sql.
the session's address(from page 18):
http://www.slideshare.net/cloudera/hadoop-world-2011-replacing-rdbdw-with-hadoop-and-hive-for-telco-big-data-jason-han-nexr
i think we should transfer the oracle's syntax tree to hive's syntax tree.that 
maybe easy.
another thing is directly transfer the oracle's sql to hive query play.but i 
think that need more time and works.


> Build An Analytical SQL Engine for MapReduce
> 
>
> Key: HIVE-3472
> URL: https://issues.apache.org/jira/browse/HIVE-3472
> Project: Hive
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Shengsheng Huang
> Attachments: SQL-design.pdf
>
>
> While there are continuous efforts in extending Hive’s SQL support (e.g., see 
> some recent examples such as HIVE-2005 and HIVE-2810), many widely used SQL 
> constructs are still not supported in HiveQL, such as selecting from multiple 
> tables, subquery in WHERE clauses, etc.  
> We propose to build a SQL-92 full compatible engine (for MapReduce based 
> analytical query processing) as an extension to Hive. 
> The SQL frontend will co-exist with the HiveQL frontend; consequently, one 
> can  mix SQL and HiveQL statements in their queries (switching between HiveQL 
> mode and SQL-92 mode using a “hive.ql.mode” parameter before each query 
> statement). This way useful Hive extensions are still accessible to users. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-565) support for buckets in the table being inserted

2012-08-09 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431923#comment-13431923
 ] 

Lianhui Wang commented on HIVE-565:
---

i think before that we need to store the bucketing-file mapping information in 
mysql
or
we can use the file name to identify the bucket number.
have any other way?

> support for buckets in the table being inserted
> ---
>
> Key: HIVE-565
> URL: https://issues.apache.org/jira/browse/HIVE-565
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.4.0
>Reporter: Namit Jain
>
> While inserting into a bucketed table, the bucketing property should be 
> maintained.
> Currently, it is lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3306) SMBJoin/BucketMapJoin should be allowed only when join key expression is exactly matches with sort/cluster key

2012-08-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427127#comment-13427127
 ] 

Lianhui Wang commented on HIVE-3306:


@Namit, i created a new jira HIVE-3329, maybe there has some tasks.
now i finish the work that the table is not partition table.
next i will work for the partition table.

> SMBJoin/BucketMapJoin should be allowed only when join key expression is 
> exactly matches with sort/cluster key
> --
>
> Key: HIVE-3306
> URL: https://issues.apache.org/jira/browse/HIVE-3306
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.10.0
>Reporter: Navis
>Assignee: Navis
>Priority: Minor
>
> CREATE TABLE bucket_small (key int, value string) CLUSTERED BY (key) SORTED 
> BY (key) INTO 2 BUCKETS STORED AS TEXTFILE;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
> bucket_small;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
> bucket_small;
> CREATE TABLE bucket_big (key int, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 4 BUCKETS STORED AS TEXTFILE;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket3outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket4outof4.txt' INTO TABLE 
> bucket_big;
> select count(*) FROM bucket_small a JOIN bucket_big b ON a.key + a.key = 
> b.key;
> select /* + MAPJOIN(a) */ count(*) FROM bucket_small a JOIN bucket_big b ON 
> a.key + a.key = b.key;
> returns 116 (same) 
> But with BucketMapJoin or SMBJoin, it returns 61. But this should not be 
> allowed cause hash(a.key) != hash(a.key + a.key). 
> Bucket context should be utilized only with exact matching join expression 
> with sort/cluster key.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-3329) Support bucket filtering when where expression or join key expression has the bucket key

2012-08-01 Thread Lianhui Wang (JIRA)
Lianhui Wang created HIVE-3329:
--

 Summary: Support bucket filtering when where expression or join 
key expression has the bucket key 
 Key: HIVE-3329
 URL: https://issues.apache.org/jira/browse/HIVE-3329
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor
Reporter: Lianhui Wang


in HIVE-3306, it introduces a context.
example:
select /* + MAPJOIN(a) */ count FROM bucket_small a JOIN bucket_big b ON a.key 
+ a.key = b.key
also there are some other contexts.i know the following example:
1. join expression is ON (a.key = b.key and a.key=10);
2. select * from bucket_small where a.key=10;
3. 
the table is a partition table,that maybe complex.
example:
CREATE TABLE srcbucket_part (key string, value string) partitioned by (ds 
string) CLUSTERED BY (key) INTO 4 BUCKETS STORED AS RCFile;
select * from srcbucket_part where key='455' and ds='2008-04-08';
maybe complex sql is:
select * from srcbucket_part where (key='455' and ds='2008-04-08') or  
ds='2008-04-09';
these contexts should not scan full table's files and only scan the some bucket 
files in the table path.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3306) SMBJoin/BucketMapJoin should be allowed only when join key expression is exactly matches with sort/cluster key

2012-08-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427069#comment-13427069
 ] 

Lianhui Wang commented on HIVE-3306:


also i think there have another context.
example:ON (a.key = b.key and a.key=10)
this should scan the 10 bucket's file.not all bucket's file in the table's path.

> SMBJoin/BucketMapJoin should be allowed only when join key expression is 
> exactly matches with sort/cluster key
> --
>
> Key: HIVE-3306
> URL: https://issues.apache.org/jira/browse/HIVE-3306
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.10.0
>Reporter: Navis
>Assignee: Navis
>Priority: Minor
>
> CREATE TABLE bucket_small (key int, value string) CLUSTERED BY (key) SORTED 
> BY (key) INTO 2 BUCKETS STORED AS TEXTFILE;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
> bucket_small;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
> bucket_small;
> CREATE TABLE bucket_big (key int, value string) CLUSTERED BY (key) SORTED BY 
> (key) INTO 4 BUCKETS STORED AS TEXTFILE;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket1outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket2outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket3outof4.txt' INTO TABLE 
> bucket_big;
> load data local inpath 
> '/home/navis/apache/oss-hive/data/files/srcsortbucket4outof4.txt' INTO TABLE 
> bucket_big;
> select count(*) FROM bucket_small a JOIN bucket_big b ON a.key + a.key = 
> b.key;
> select /* + MAPJOIN(a) */ count(*) FROM bucket_small a JOIN bucket_big b ON 
> a.key + a.key = b.key;
> returns 116 (same) 
> But with BucketMapJoin or SMBJoin, it returns 61. But this should not be 
> allowed cause hash(a.key) != hash(a.key + a.key). 
> Bucket context should be utilized only with exact matching join expression 
> with sort/cluster key.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-3254) Reuse RunningJob

2012-07-29 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13424569#comment-13424569
 ] 

Lianhui Wang commented on HIVE-3254:


yes, i think that can do.
but maybe the newRj is null.so you must check the null.
because the jobtracker always cache the fixed-size completed job's infos.
if the job that you get have completed,maybe the JT removed the job's 
information.

> Reuse RunningJob 
> -
>
> Key: HIVE-3254
> URL: https://issues.apache.org/jira/browse/HIVE-3254
> Project: Hive
>  Issue Type: Bug
>Reporter: binlijin
>
>   private MapRedStats progress(ExecDriverTaskHandle th) throws IOException {
> while (!rj.isComplete()) {
>try {
>  Thread.sleep(pullInterval); 
>} catch (InterruptedException e) { 
>} 
>RunningJob newRj = jc.getJob(rj.getJobID());
> }
>   }
> Should we reuse the RunningJob? If not, why? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-942) use bucketing for group by

2012-07-01 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404730#comment-13404730
 ] 

Lianhui Wang commented on HIVE-942:
---

i think in HIVE-931 ,the group by keys must be the same with the sort keys.
but in the case that the group by keys contain the sort keys, it may be 
complete it to use the hash table on the mapper.
for example:
t is a bucket table, sort by c1,c2.
sql: select t.c1,t.c2,t.c3.sum(t.c4) from t group by t.c1,t.c2,t.c3.
i think generally that only use the hash table on the mapper.so do not do 
anything on the reducer.
 

> use bucketing for group by
> --
>
> Key: HIVE-942
> URL: https://issues.apache.org/jira/browse/HIVE-942
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Namit Jain
>
> Group by on a bucketed column can be completely performed on the mapper if 
> the split can be adjusted to span the key boundary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-895) Add SerDe for Avro serialized data

2011-10-07 Thread Lianhui Wang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122896#comment-13122896
 ] 

Lianhui Wang commented on HIVE-895:
---

@Jakob:i read the code of the haivvreo.
and i think it can do with protocol buffers like haivvreo.
google 's paper tenzing like hive said it support protocol buffers and columnIO.

> Add SerDe for Avro serialized data
> --
>
> Key: HIVE-895
> URL: https://issues.apache.org/jira/browse/HIVE-895
> Project: Hive
>  Issue Type: New Feature
>  Components: Serializers/Deserializers
>Reporter: Jeff Hammerbacher
>Assignee: Jakob Homan
>
> As Avro continues to mature, having a SerDe to allow HiveQL queries over Avro 
> data seems like a solid win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira