Maybe you were stilling using the cli which was pointing to hive 0.11 libs. After you build trunk (https://github.com/apache/hive.git), you need to use <trunk-dir>/build/dist as your hive home and use <trunk-dir>/build/dist/bin/hive to launch hive cli. You can find hive 0.13 libs in <trunk-dir>/build/dist/lib
btw, seems trunk has an issue today. You can try hive 0.12 branch. On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman <mikael_u...@yahoo.se> wrote: > Hello again. > > I have now checked out latest code from trunk and built as per > instructions. > > However, this query: > > select a.Symbol, count(*) > from (select Symbol, catid from cat group by Symbol, catid) a > group by a.Symbol; > > still returns an incorrect number of rows for table: > > > create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE > int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, > ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, > SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID > int) row format delimited fields terminated by ',' stored as ORC; > > Here is the result of EXPLAIN: > > hive> EXPLAIN select a.Symbol, count(*) > > from (select Symbol, catid from cat group by Symbol, catid) a > > group by a.Symbol; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF > (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) > (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR > (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol) > (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR > TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol)) > (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL > a) Symbol)))) > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > a:cat > TableScan > alias: cat > Select Operator > expressions: > expr: symbol > type: string > expr: catid > type: bigint > outputColumnNames: symbol, catid > Group By Operator > bucketGroup: false > keys: > expr: symbol > type: string > expr: catid > type: bigint > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > expr: _col1 > type: bigint > tag: -1 > Reduce Operator Tree: > Group By Operator > bucketGroup: false > keys: > expr: KEY._col0 > type: string > expr: KEY._col1 > type: bigint > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col0 > type: string > outputColumnNames: _col0 > Group By Operator > aggregations: > expr: count() > bucketGroup: false > keys: > expr: _col0 > type: string > mode: complete > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > outputColumnNames: _col0, _col1 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > > Stage: Stage-0 > Fetch Operator > limit: -1 > > Using set hive.optimize.reducededuplication=false; > I get 2 mapreduce jobs and the correct number of rows (24). > > Can I verify somehow, maybe through looking in the source code, that I > indeed have the correct version? Or execute a command from hive cli that > shows version etc. Just built from source this morning so seems strange > that the bug would still persist :(. > > ------------------------------ > *Från:* Yin Huai <huaiyin....@gmail.com> > *Till:* user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> > *Skickat:* tisdag, 17 september 2013 15:30 > > *Ämne:* Re: Duplicate rows when using group by in subquery > > Hello Mikael, > > ReduceSinkDeduplication automatically kicked in because it is enabled by > default. The original plan tries to shuffle your data twice. Then, > ReduceSinkDeduplication finds that the original plan can be optimized to > shuffle your data once. But, when picking the partitioning columns, this > optimizer picked the wrong columns because of the bug. > > Also, you can try your query with and without ReduceSinkDeduplication (use > "set hive.optimize.reducededuplication=false;" to turn this optimization > off), and see the performance. If the cardinality of "a.Symbol" limits the > degree of parallelism, two MapReduce jobs may be faster. > > Thanks, > > Yin > > > On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mikael_u...@yahoo.se>wrote: > > Thank you for the information. Just to be clear, it is not that I have > manually restricted the job to run using only a single mapreduce job, but > it incorrectly assumes one job is enough? > > I will get back with results from your suggestions ASAP; unfortunately I > don't have the machines available until Thursday. > > / Sincerely Mikael > > *Från:* Yin Huai <huaiyin....@gmail.com> > *Till:* user@hive.apache.org; Mikael Öhman <mikael_u...@yahoo.se> > *Skickat:* måndag, 16 september 2013 19:52 > *Ämne:* Re: Duplicate rows when using group by in subquery > > Hello Mikael, > > Seems your case is related to the bug reported in > https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive > uses a single MapReduce job to evaluate your query, "c.Symbol" and > "c.catid" are used to partitioning data, and thus, rows with the same value > of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol" > should be used if we want to use a single MapReduce job to evaluate this > query. Can you check the query plan (results of "EXPLAIN") and see if both > "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also > attach your query plan. > > This bug have been fixed in trunk. Can you test your query in trunk and > see if the result is correct. If you are using hive 0.11, you can also > apply the 0.11 patch ( > https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch > ). > > Thanks, > > Yin > > > On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mikael_u...@yahoo.se>wrote: > > Hello. > > This is basically the same question I posted on stackoverflow: > http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115 > > I know the query is a bit noisy. But this query also demonstrates the > error: > > select a.symbol from (select symbol, ordertype from cat group by symbol, > ordertype) a group by a.symbol; > > Now, this query may not make much sense but in my case, because I have 24 > symbols, I expect a result of 24 rows. But I get 48 rows back. A similar > query: > > select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group > by c.Symbol,c.catid) a group by a.Symbol; > > returns 864 rows, where I still expect 24 rows... If there are > alternatives as to how to write the original query in my SO post I would > much appreciate hearing them. The examples given in this mail have just > been provided to demonstrate the problem using easier to understand queries > and I don't need advice on them. > > The .csv data and example is from a toy example. My real setup is 6 nodes, > and the table definition is: > > create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE > int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, > ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, > SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID > int) row format delimited fields terminated by ',' stored as ORC; > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.max.dynamic.partitions.pernode=1000; > > insert... > > Thank you so much for any input. > > /Sincerely Mikael > > > > > > > >