Re: Improving self join time

Stephen Sprague Thu, 20 Mar 2014 16:04:20 -0700

so that's your final assessment, eh? :)  What is your comment about the
outer query _joining on value_ to get the key?





On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey <storey.j...@gmail.com> wrote:

> I don't think so since the inner result doesn't have the key field in it.
> It ends up being
>
> select key from (query result that doesn't contain the key field) ...
>
>
> On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague <sprag...@gmail.com>wrote:
>
>> I agree with your assessment of the inner query. why stop there though?
>> Doesn't the outer query fetch the ids of the tags that the inner query
>> identified?
>>
>>
>>
>> On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey <storey.j...@gmail.com>wrote:
>>
>>> I don't think this quite fits here..I think the inner query will give me
>>> a list of duplicate elements and their counts, but it loses the information
>>> as to what id had these elements.
>>>
>>> I'm trying to find which pairs of ids have any duplicate tags.
>>>
>>>
>>> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague <sprag...@gmail.com>wrote:
>>>
>>>> hmm. would this not fall under the general problem of identifying
>>>> duplicates?
>>>>
>>>> Would something like this meet your needs? (untested)
>>>>
>>>> select  -- outer query finds the ids for the duplicates
>>>>     key
>>>>
>>>> from (  -- inner query lists duplicate values
>>>>       select
>>>>        count(*) as cnt,
>>>>        value
>>>>      from
>>>>         foo
>>>>      group by
>>>>         value
>>>>      having
>>>>        count(*) > 1
>>>>      ) z
>>>>      join foo a on (a.value = z.value)
>>>> ;
>>>>
>>>> table foo is your table elements
>>>> key is your id,
>>>> value is your element
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey <storey.j...@gmail.com>wrote:
>>>>
>>>>> I have a table with 10 million rows and 2 columns - id (int) and
>>>>> element (string). I am trying to do a self join that finds any ids where
>>>>> the element values are the same, and my query looks like:
>>>>>
>>>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
>>>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>>>>
>>>>> I tested this at a smaller scale and it works well. The problem is
>>>>> that with 10 million rows, this becomes a bit large and I've let it run 
>>>>> for
>>>>> 90 minutes and it was up to 80GB of disk space and still going. The
>>>>> original input data was only 500MB.
>>>>>
>>>>> Is this something I can optimize in hive? Or should I be considering a
>>>>> different approach to the problem instead?
>>>>>
>>>>> Any guidance here would be helpful. Thank you.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Improving self join time

Reply via email to