Re: Duplicate rows when using group by in subquery

Yin Huai Thu, 19 Sep 2013 11:32:12 -0700

Maybe you were stilling using the cli which was pointing to hive 0.11 libs.
 After you build trunk (https://github.com/apache/hive.git), you need to
use <trunk-dir>/build/dist as your hive home and
use <trunk-dir>/build/dist/bin/hive to launch hive cli. You can find
hive 0.13 libs in <trunk-dir>/build/dist/lib


btw, seems trunk has an issue today. You can try hive 0.12 branch.


On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote:

> Hello again.
>
> I have now checked out latest code from trunk and built as per
> instructions.
>
> However, this query:
>
> select a.Symbol, count(*)
> from (select Symbol, catid from cat group by Symbol, catid) a
> group by a.Symbol;
>
> still returns an incorrect number of rows for table:
>
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
>
> Here is the result of EXPLAIN:
>
> hive> EXPLAIN select a.Symbol, count(*)
>     > from (select Symbol, catid from cat group by Symbol, catid) a
>     > group by a.Symbol;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
> (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR
> (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol)
> (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol))
> (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL
> a) Symbol))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:cat
>           TableScan
>             alias: cat
>             Select Operator
>               expressions:
>                     expr: symbol
>                     type: string
>                     expr: catid
>                     type: bigint
>               outputColumnNames: symbol, catid
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: symbol
>                       type: string
>                       expr: catid
>                       type: bigint
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>             outputColumnNames: _col0
>             Group By Operator
>               aggregations:
>                     expr: count()
>               bucketGroup: false
>               keys:
>                     expr: _col0
>                     type: string
>               mode: complete
>               outputColumnNames: _col0, _col1
>               Select Operator
>                 expressions:
>                       expr: _col0
>                       type: string
>                       expr: _col1
>                       type: bigint
>                 outputColumnNames: _col0, _col1
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
> Using set hive.optimize.reducededuplication=false;
> I get 2 mapreduce jobs and the correct number of rows (24).
>
> Can I verify somehow, maybe through looking in the source code, that I
> indeed have the correct version? Or execute a command from hive cli that
> shows version etc. Just built from source this morning so seems strange
> that the bug would still persist :(.
>
>   ------------------------------
>  *Från:* Yin Huai <huaiyin....@gmail.com>
> *Till:* user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se>
> *Skickat:* tisdag, 17 september 2013 15:30
>
> *Ämne:* Re: Duplicate rows when using group by in subquery
>
> Hello Mikael,
>
> ReduceSinkDeduplication automatically kicked in because it is enabled by
> default. The original plan tries to shuffle your data twice. Then,
> ReduceSinkDeduplication finds that the original plan can be optimized to
> shuffle your data once. But, when picking the partitioning columns, this
> optimizer picked the wrong columns because of the bug.
>
> Also, you can try your query with and without ReduceSinkDeduplication (use
> "set hive.optimize.reducededuplication=false;" to turn this optimization
> off), and see the performance. If the cardinality of "a.Symbol" limits the
> degree of parallelism, two MapReduce jobs may be faster.
>
> Thanks,
>
> Yin
>
>
> On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mikael_u...@yahoo.se>wrote:
>
> Thank you for the information. Just to be clear, it is not that I have
> manually restricted the job to run using only a single mapreduce job, but
> it incorrectly assumes one job is enough?
>
> I will get back with results from your suggestions ASAP; unfortunately I
> don't have the machines available until Thursday.
>
> / Sincerely Mikael
>
>    *Från:* Yin Huai <huaiyin....@gmail.com>
> *Till:* user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se>
> *Skickat:* måndag, 16 september 2013 19:52
> *Ämne:* Re: Duplicate rows when using group by in subquery
>
> Hello Mikael,
>
> Seems your case is related to the bug reported in
> https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive
> uses a single MapReduce job to evaluate your query, "c.Symbol" and
> "c.catid" are used to partitioning data, and thus, rows with the same value
> of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol"
> should be used if we want to use a single MapReduce job to evaluate this
> query. Can you check the query plan (results of "EXPLAIN") and see if both
> "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also
> attach your query plan.
>
> This bug have been fixed in trunk. Can you test your query in trunk and
> see if the result is correct. If you are using hive 0.11, you can also
> apply the 0.11 patch (
> https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch
> ).
>
> Thanks,
>
> Yin
>
>
> On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mikael_u...@yahoo.se>wrote:
>
> Hello.
>
> This is basically the same question I posted on stackoverflow:
> http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>
> I know the query is a bit noisy. But this query also demonstrates the
> error:
>
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
>
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> query:
>
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
>
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
>
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
>
> insert...
>
> Thank you so much for any input.
>
> /Sincerely Mikael
>
>
>
>
>
>
>
>

Re: Duplicate rows when using group by in subquery

Reply via email to