Re: Load Hive query result with array field into pig

2014-03-21 Thread Jeff Storey
Sorry for another post on this thread. I had an error in my pigscript that
had the wrong unicode character to split on. Using STRSPLIT worked well.


On Fri, Mar 21, 2014 at 8:46 AM, Jeff Storey  wrote:

> Correction - it looks like the query uses \u002 to separate array elements
> and \u001 to separate the other fields. The question is still similar
> though in wondering how I can load that array into pig.
>
> Note - If my data is formatted as a tsv with parentheses surrounding the
> array:
>
> (element1,element2,element3)anotherfield
>
> This loads properly when I use LOAD '/my/tsvfile' USING PigStorage('\t')
> AS (elements:tuple(),afield:chararray);
>
>
> On Fri, Mar 21, 2014 at 8:38 AM, Jeff Storey wrote:
>
>> I'm executing a hive query in which one of the fields an array and
>> writing it to a file using:
>>
>> INSERT OVERWRITE '/path/to/output' SELECT ...
>>
>> This query works well. I would like to load this data into pig, but I'm
>> quite sure how to get the array properly into pig.
>>
>> My output file from the query doesn't look like it has any array notation
>> when using the default ^A separator.
>>
>> Is there a way I can execute a query with an array field output that can
>> then be directly loaded into pig as a tuple?
>>
>> Thank you.
>>
>
>


Re: Load Hive query result with array field into pig

2014-03-21 Thread Jeff Storey
Correction - it looks like the query uses \u002 to separate array elements
and \u001 to separate the other fields. The question is still similar
though in wondering how I can load that array into pig.

Note - If my data is formatted as a tsv with parentheses surrounding the
array:

(element1,element2,element3)anotherfield

This loads properly when I use LOAD '/my/tsvfile' USING PigStorage('\t') AS
(elements:tuple(),afield:chararray);


On Fri, Mar 21, 2014 at 8:38 AM, Jeff Storey  wrote:

> I'm executing a hive query in which one of the fields an array and writing
> it to a file using:
>
> INSERT OVERWRITE '/path/to/output' SELECT ...
>
> This query works well. I would like to load this data into pig, but I'm
> quite sure how to get the array properly into pig.
>
> My output file from the query doesn't look like it has any array notation
> when using the default ^A separator.
>
> Is there a way I can execute a query with an array field output that can
> then be directly loaded into pig as a tuple?
>
> Thank you.
>


Load Hive query result with array field into pig

2014-03-21 Thread Jeff Storey
I'm executing a hive query in which one of the fields an array and writing
it to a file using:

INSERT OVERWRITE '/path/to/output' SELECT ...

This query works well. I would like to load this data into pig, but I'm
quite sure how to get the array properly into pig.

My output file from the query doesn't look like it has any array notation
when using the default ^A separator.

Is there a way I can execute a query with an array field output that can
then be directly loaded into pig as a tuple?

Thank you.


Re: Improving self join time

2014-03-20 Thread Jeff Storey
I don't think so since the inner result doesn't have the key field in it.
It ends up being

select key from (query result that doesn't contain the key field) ...


On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague  wrote:

> I agree with your assessment of the inner query. why stop there though?
> Doesn't the outer query fetch the ids of the tags that the inner query
> identified?
>
>
>
> On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey wrote:
>
>> I don't think this quite fits here..I think the inner query will give me
>> a list of duplicate elements and their counts, but it loses the information
>> as to what id had these elements.
>>
>> I'm trying to find which pairs of ids have any duplicate tags.
>>
>>
>> On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague wrote:
>>
>>> hmm. would this not fall under the general problem of identifying
>>> duplicates?
>>>
>>> Would something like this meet your needs? (untested)
>>>
>>> select  -- outer query finds the ids for the duplicates
>>> key
>>>
>>> from (  -- inner query lists duplicate values
>>>   select
>>>count(*) as cnt,
>>>value
>>>  from
>>> foo
>>>  group by
>>>     value
>>>  having
>>>count(*) > 1
>>>  ) z
>>>  join foo a on (a.value = z.value)
>>> ;
>>>
>>> table foo is your table elements
>>> key is your id,
>>> value is your element
>>>
>>>
>>> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey wrote:
>>>
>>>> I have a table with 10 million rows and 2 columns - id (int) and
>>>> element (string). I am trying to do a self join that finds any ids where
>>>> the element values are the same, and my query looks like:
>>>>
>>>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1
>>>> JOIN elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>>>
>>>> I tested this at a smaller scale and it works well. The problem is that
>>>> with 10 million rows, this becomes a bit large and I've let it run for 90
>>>> minutes and it was up to 80GB of disk space and still going. The original
>>>> input data was only 500MB.
>>>>
>>>> Is this something I can optimize in hive? Or should I be considering a
>>>> different approach to the problem instead?
>>>>
>>>> Any guidance here would be helpful. Thank you.
>>>>
>>>
>>>
>>
>


Re: Improving self join time

2014-03-20 Thread Jeff Storey
I don't think this quite fits here..I think the inner query will give me a
list of duplicate elements and their counts, but it loses the information
as to what id had these elements.

I'm trying to find which pairs of ids have any duplicate tags.


On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague wrote:

> hmm. would this not fall under the general problem of identifying
> duplicates?
>
> Would something like this meet your needs? (untested)
>
> select  -- outer query finds the ids for the duplicates
> key
>
> from (  -- inner query lists duplicate values
>   select
>count(*) as cnt,
>value
>  from
> foo
>  group by
> value
>  having
>count(*) > 1
>  ) z
>  join foo a on (a.value = z.value)
> ;
>
> table foo is your table elements
> key is your id,
> value is your element
>
>
> On Thu, Mar 20, 2014 at 7:03 AM, Jeff Storey wrote:
>
>> I have a table with 10 million rows and 2 columns - id (int) and element
>> (string). I am trying to do a self join that finds any ids where the
>> element values are the same, and my query looks like:
>>
>> select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
>> elements e2 on e1.element = e2.element WHERE e1.id < e2.id;
>>
>> I tested this at a smaller scale and it works well. The problem is that
>> with 10 million rows, this becomes a bit large and I've let it run for 90
>> minutes and it was up to 80GB of disk space and still going. The original
>> input data was only 500MB.
>>
>> Is this something I can optimize in hive? Or should I be considering a
>> different approach to the problem instead?
>>
>> Any guidance here would be helpful. Thank you.
>>
>
>


Improving self join time

2014-03-20 Thread Jeff Storey
I have a table with 10 million rows and 2 columns - id (int) and element
(string). I am trying to do a self join that finds any ids where the
element values are the same, and my query looks like:

select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN
elements e2 on e1.element = e2.element WHERE e1.id < e2.id;

I tested this at a smaller scale and it works well. The problem is that
with 10 million rows, this becomes a bit large and I've let it run for 90
minutes and it was up to 80GB of disk space and still going. The original
input data was only 500MB.

Is this something I can optimize in hive? Or should I be considering a
different approach to the problem instead?

Any guidance here would be helpful. Thank you.