so that's your final assessment, eh? :) What is your comment about the outer query _joining on value_ to get the key?
On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey <storey.j...@gmail.com> wrote: > I don't think so since the inner result doesn't have the key field in it. > It ends up being > > select key from (query result that doesn't contain the key field) ... > > > On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague <sprag...@gmail.com>wrote: > >> I agree with your assessment of the inner query. why stop there though? >> Doesn't the outer query fetch the ids of the tags that the inner query >> identified? >> >> >> >> On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey <storey.j...@gmail.com>wrote: >> >>> I don't think this quite fits here..I think the inner query will give me >>> a list of duplicate elements and their counts, but it loses the information >>> as to what id had these elements. >>> >>> I'm trying to find which pairs of ids have any duplicate tags. >>> >>> >>> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague <sprag...@gmail.com>wrote: >>> >>>> hmm. would this not fall under the general problem of identifying >>>> duplicates? >>>> >>>> Would something like this meet your needs? (untested) >>>> >>>> select -- outer query finds the ids for the duplicates >>>> key >>>> >>>> from ( -- inner query lists duplicate values >>>> select >>>> count(*) as cnt, >>>> value >>>> from >>>> foo >>>> group by >>>> value >>>> having >>>> count(*) > 1 >>>> ) z >>>> join foo a on (a.value = z.value) >>>> ; >>>> >>>> table foo is your table elements >>>> key is your id, >>>> value is your element >>>> >>>> >>>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey <storey.j...@gmail.com>wrote: >>>> >>>>> I have a table with 10 million rows and 2 columns - id (int) and >>>>> element (string). I am trying to do a self join that finds any ids where >>>>> the element values are the same, and my query looks like: >>>>> >>>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 >>>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id; >>>>> >>>>> I tested this at a smaller scale and it works well. The problem is >>>>> that with 10 million rows, this becomes a bit large and I've let it run >>>>> for >>>>> 90 minutes and it was up to 80GB of disk space and still going. The >>>>> original input data was only 500MB. >>>>> >>>>> Is this something I can optimize in hive? Or should I be considering a >>>>> different approach to the problem instead? >>>>> >>>>> Any guidance here would be helpful. Thank you. >>>>> >>>> >>>> >>> >> >