TRANSFORM should allow pipes in some form
-----------------------------------------

                 Key: HIVE-1251
                 URL: https://issues.apache.org/jira/browse/HIVE-1251
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Adam Kramer


Many traditional transforms can be accomplished via simple unix commands 
chained together. For example, the "sort" phase is an instance of "cut -f 1 | 
sort". However, the TRANSFORM command in Hive doesn't allow for unix-style 
piping to occur.

One classic case where I wish there was piping is when I want to "stack" a 
column into several rows:

SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python 
reducer.py' AS key, value

...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In 
this case, the current workaround is this:

SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
    (SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, 
col FROM table)

...the problem here is that as a user, *I should not be allowed to assume* that 
the output from the inner query will be passed DIRECTLY to the outer query 
(i.e., the outer query should not assume that it gets the inner query's output 
on the same box and in the same order). I know as a programmer that this works 
fine as a pipe, but when writing Hive code I always wonder--what if Hive 
decides to run the inner query in a reduce step, and the outer query in a 
subsequent map step?

Broadly, my understanding is that the goal of Hive is to abstract the mapreduce 
process away from users. To this end, we have syntax (CLUSTER BY) that allows 
users to assume that a reduce task will occur (but see also 
https://issues.apache.org/jira/browse/HIVE-835 ), but there is no formal way to 
force or syntactically assume that the data will NOT be copied or sorted or 
transformed. I argue that the only case where this would be necessary or 
desirable would be in the instance of a pipe within a transform...ergo a desire 
for | to work as expected.

An alternative would be for the HQL language definition to explicitly state all 
conditions that would cause a task boundary to be crossed (so I can make the 
strong assumption that if none of those conditions obtains, my query will be 
supported in the future)...but that seems potentially restrictive as the 
language and Hadoop evolves.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to