[jira] [Resolved] (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper

2015-01-08 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer resolved HIVE-836.
--
  Resolution: Won't Fix
Release Note: See comments for workarounds.

> Add syntax to force a new mapreduce job / transform subquery in mapper
> --
>
> Key: HIVE-836
> URL: https://issues.apache.org/jira/browse/HIVE-836
> Project: Hive
>  Issue Type: Wish
>Reporter: Adam Kramer
>
> Hive currently does a lot of awesome work to figure out when my transformers 
> should be used in the mapper and when they should be used in the reducer. 
> However, sometimes I have a different plan.
> For example, consider this:
> {code:title=foo.sql}
> SELECT TRANSFORM(a.val1, a.val2)
> USING './niftyscript'
> AS part1, part2, part3
> FROM (
> SELECT b.val AS val1, c.val AS val2
> FROM tblb b JOIN tblc c on (b.key=c.key)
> ) a
> {code}
> ...now, assume that the join step is very easy and 'niftyscript' is really 
> processor intensive. The ideal format for this is a MR task with few mappers 
> and few reducers, and then a second MR task with lots of mappers.
> Currently, there is no way to even require the outer TRANSFORM statement 
> occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin 
> to /* +MAPJOIN(x) */, would be awesome.
> Current workaround is to dump everything to a temporary table and then start 
> over, but that is not an easy to scale--the subquery structure effectively 
> (and easily) "locks" the mid-points so no other job can touch the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper

2015-01-08 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270390#comment-14270390
 ] 

Adam Kramer commented on HIVE-836:
--

Oh hey there five year old task.

Workaround: Use CLUSTER BY to force a reduce phase, and a staging table to 
force a map phase. Hive writes all the data to disk in every phase anyway so 
the staging table isn't actually a performance hit.

Also protip: DON'T get distracted by the Hive keywords "MAP" and "REDUCE", they 
are just synonyms for TRANSFORM and do not do what anybody expects.

> Add syntax to force a new mapreduce job / transform subquery in mapper
> --
>
> Key: HIVE-836
> URL: https://issues.apache.org/jira/browse/HIVE-836
> Project: Hive
>  Issue Type: Wish
>Reporter: Adam Kramer
>
> Hive currently does a lot of awesome work to figure out when my transformers 
> should be used in the mapper and when they should be used in the reducer. 
> However, sometimes I have a different plan.
> For example, consider this:
> {code:title=foo.sql}
> SELECT TRANSFORM(a.val1, a.val2)
> USING './niftyscript'
> AS part1, part2, part3
> FROM (
> SELECT b.val AS val1, c.val AS val2
> FROM tblb b JOIN tblc c on (b.key=c.key)
> ) a
> {code}
> ...now, assume that the join step is very easy and 'niftyscript' is really 
> processor intensive. The ideal format for this is a MR task with few mappers 
> and few reducers, and then a second MR task with lots of mappers.
> Currently, there is no way to even require the outer TRANSFORM statement 
> occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin 
> to /* +MAPJOIN(x) */, would be awesome.
> Current workaround is to dump everything to a temporary table and then start 
> over, but that is not an easy to scale--the subquery structure effectively 
> (and easily) "locks" the mid-points so no other job can touch the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-3491) Expose column names to UDFs

2012-09-20 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459968#comment-13459968
 ] 

Adam Kramer commented on HIVE-3491:
---

These aren't "functions," they're classes and methods -- and it is entirely 
reasonable from a programming standpoint to have classes be created or 
instantiated with some amount of context.

We already have certain functions fail in certain ways because "strict mode" is 
not set to on, for example -- this task is to let "User-Defined" functions do 
this, too.

Also, from an underpinnings standpoint, it is entirely reasonable for a job 
running on a mapreduce node to pass random set variables from the client to the 
nodes.

And once again, there ARE cases in which the user cannot easily pass the column 
names in, as mentioned, when using an asterisk to sweep all of the columns.

> Expose column names to UDFs
> ---
>
> Key: HIVE-3491
> URL: https://issues.apache.org/jira/browse/HIVE-3491
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor, UDF
>Reporter: Adam Kramer
>
> If I run
> SELECT MY_FUNC(a.foo, b.bar) FROM baz1 a JOIN baz2 b;
> ...the parsed query structure (i.e., that "foo" and "bar" are the name of the 
> columns) should be available to the UDF in some manner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3491) Expose column names to UDFs

2012-09-20 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459870#comment-13459870
 ] 

Adam Kramer commented on HIVE-3491:
---

Edward, are you saying that that makes this UNLIKELY to be completed? Or that 
the three things should be merged together to communally get over the hump?

> Expose column names to UDFs
> ---
>
> Key: HIVE-3491
> URL: https://issues.apache.org/jira/browse/HIVE-3491
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor, UDF
>Reporter: Adam Kramer
>
> If I run
> SELECT MY_FUNC(a.foo, b.bar) FROM baz1 a JOIN baz2 b;
> ...the parsed query structure (i.e., that "foo" and "bar" are the name of the 
> columns) should be available to the UDF in some manner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3491) Expose column names to UDFs

2012-09-20 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459436#comment-13459436
 ] 

Adam Kramer commented on HIVE-3491:
---

Yes, the goal is to have this be eventually exposed.

Having the column names generally available will also help a lot with some of 
our worse error messages. :) So even if it is not trivial it is worth taking 
seriously.

> Expose column names to UDFs
> ---
>
> Key: HIVE-3491
> URL: https://issues.apache.org/jira/browse/HIVE-3491
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor, UDF
>Reporter: Adam Kramer
>
> If I run
> SELECT MY_FUNC(a.foo, b.bar) FROM baz1 a JOIN baz2 b;
> ...the parsed query structure (i.e., that "foo" and "bar" are the name of the 
> columns) should be available to the UDF in some manner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-3491) Expose column names to UDFs

2012-09-19 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459328#comment-13459328
 ] 

Adam Kramer commented on HIVE-3491:
---

The goal is to not do that or not have to do that. Also this does not work with 
SELECT MY_FUNC(*) which is the eventual use case.

> Expose column names to UDFs
> ---
>
> Key: HIVE-3491
> URL: https://issues.apache.org/jira/browse/HIVE-3491
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor, UDF
>Reporter: Adam Kramer
>
> If I run
> SELECT MY_FUNC(a.foo, b.bar) FROM baz1 a JOIN baz2 b;
> ...the parsed query structure (i.e., that "foo" and "bar" are the name of the 
> columns) should be available to the UDF in some manner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3491) Expose column names to UDFs

2012-09-19 Thread Adam Kramer (JIRA)
Adam Kramer created HIVE-3491:
-

 Summary: Expose column names to UDFs
 Key: HIVE-3491
 URL: https://issues.apache.org/jira/browse/HIVE-3491
 Project: Hive
  Issue Type: New Feature
  Components: Query Processor, UDF
Reporter: Adam Kramer


If I run

SELECT MY_FUNC(a.foo, b.bar) FROM baz1 a JOIN baz2 b;

...the parsed query structure (i.e., that "foo" and "bar" are the name of the 
columns) should be available to the UDF in some manner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3490) Implement * or a.* for arguments to UDFs

2012-09-19 Thread Adam Kramer (JIRA)
Adam Kramer created HIVE-3490:
-

 Summary: Implement * or a.* for arguments to UDFs
 Key: HIVE-3490
 URL: https://issues.apache.org/jira/browse/HIVE-3490
 Project: Hive
  Issue Type: Bug
  Components: Query Processor, UDF
Reporter: Adam Kramer


For a random UDF, we should be able to use * or a.* to refer to "all of the 
columns in their natural order." This is not currently implemented.

I'm reporting this as a bug because it is a manner in which Hive is 
inconsistent with the SQL spec, and because Hive claims to implement *.

hive> select all_non_null(a.*) from table a where a.ds='2012-09-01';
FAILED: ParseException line 1:25 mismatched input '*' expecting Identifier near 
'.' in expression specification


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-2971) GET_JSON_OBJECT fails on some valid JSON keys

2012-04-23 Thread Adam Kramer (JIRA)
Adam Kramer created HIVE-2971:
-

 Summary: GET_JSON_OBJECT fails on some valid JSON keys
 Key: HIVE-2971
 URL: https://issues.apache.org/jira/browse/HIVE-2971
 Project: Hive
  Issue Type: Bug
  Components: UDF
Reporter: Adam Kramer
Priority: Minor


hive> SELECT GET_JSON_OBJECT("{\"Form Name\": 12345}", "$.Form\ Name") FROM 
akramer_one_row;
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
NULL

...this also returns null for "$.Form Name" and "$.Form\\ Name". It should 
return the relevant key.

Removing the space works fine, however, spaces are allowed as JSON keys (see 
spec at http://www.json.org/ ). As such, this is a bug.

Claiming that this is org.json's problem, or something similar, does not solve 
this bug. It's Hive that claims this gets a JSON object, so it needs to provide 
the JSON object.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2678) Programmatically limit CLI status updates

2012-04-23 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13259820#comment-13259820
 ] 

Adam Kramer commented on HIVE-2678:
---

No, that is not enough. For some queries ms makes sense, but for others % done 
makes sense. If a query increases in more than 1% done in the interval, the 
lack of output is maddening.

> Programmatically limit CLI status updates
> -
>
> Key: HIVE-2678
> URL: https://issues.apache.org/jira/browse/HIVE-2678
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI, Logging
>Reporter: Adam Kramer
>
> Provide a way to configure the frequency of Hive logging output, i.e., these:
> 2011-12-23 22:31:20,979 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 
> 567.27 sec
> Some jobs update more than once per second, which is way more than necessary 
> (and runs users out of scrollback buffer when using the CLI in screen).
> Default should be to update when map % or reduce % complete has gone up by 
> one, and should be configurable via "SET mapred.update.rate=N;" to indicate 
> that I would like updates every N seconds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2409) Semicolons in strings/comments are parsed as query-ending.

2011-08-25 Thread Adam Kramer (JIRA)
Semicolons in strings/comments are parsed as query-ending.
--

 Key: HIVE-2409
 URL: https://issues.apache.org/jira/browse/HIVE-2409
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Adam Kramer


This fails:
select '.*? (f_.*?)[ ;$]' from akramer_one_row ;

This succeeds:
select '.*? (f_.*?)[ \;$]' from akramer_one_row ;

...there is no reasonable syntactic structure that wuold require the escaping 
of a semicolon in a '-marked string. The query parser should NOT split on 
semicolons that are in strings OR in comments. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2363) Implicitly CLUSTER BY when dynamically partitioning

2011-08-09 Thread Adam Kramer (JIRA)
Implicitly CLUSTER BY when dynamically partitioning
---

 Key: HIVE-2363
 URL: https://issues.apache.org/jira/browse/HIVE-2363
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Adam Kramer
Priority: Critical


Whenever someone is dynamically creating partitions, the underlying 
implementation is to look at the output data, write it to a file so long as the 
partition columns are contiguous, then to close that file and open a new one if 
the partition column changes. This leads to potentially way too many files 
generated.

The solution is to ensure that a partition column's data all appears in a row 
and on the same reducer. I.e., to cluster by the partitioning columns on the 
way out.

This improvement is to detect whether a query is clustering by the eventual 
partition columns, and if not, to do so as an additional step at the end of the 
query. This will potentially save lots of space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2333) LazySimpleSerDe does not properly handle arrays / escape control characters

2011-08-02 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-2333:
--

Priority: Critical  (was: Major)

> LazySimpleSerDe does not properly handle arrays / escape control characters
> ---
>
> Key: HIVE-2333
> URL: https://issues.apache.org/jira/browse/HIVE-2333
> Project: Hive
>  Issue Type: Bug
>Reporter: Jonathan Chang
>Priority: Critical
>
> LazySimpleSerDe, the default SerDe for Hive is severely broken:
> * Empty arrays are serialized as an empty string. Hence an array(array()) is 
> indistinguishable from array(array(array())) from array().
> * Similarly, empty strings are serialized as an empty string. Hence array('') 
> is also indistinguishable from an empty array.
> * if the serialized string equals the null sequence, then it is ambiguous as 
> to whether it is an array with a single null element or a null array.
> It also does not do well with control characters:
> > select array('foo\002bar') from tmp;
> ...
> ["foo","bar"]
> > select array('foo\001bar') from tmp;
> ...
> ["foo"]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2330) Allow tables to warn on querying

2011-08-01 Thread Adam Kramer (JIRA)
Allow tables to warn on querying


 Key: HIVE-2330
 URL: https://issues.apache.org/jira/browse/HIVE-2330
 Project: Hive
  Issue Type: Wish
Reporter: Adam Kramer


It would be excellent if I could set a TBLPROPERTY like "motd" or "motq" that 
means "message of the query." This message would then actually print to the CLI 
or stderr whenever the table was queried.

Use cases: Warning people that they are querying a deprecated table, or warning 
people that the table they are querying has sensitive information, or that it 
is probably already aggregated somewhere.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2327) UDFs should be made aware when their arguments are constants.

2011-07-31 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-2327:
--

Summary: UDFs should be made aware when their arguments are constants.  
(was: UDFs should be made aware of their arguments are constants.)

> UDFs should be made aware when their arguments are constants.
> -
>
> Key: HIVE-2327
> URL: https://issues.apache.org/jira/browse/HIVE-2327
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> There are a lot of UDFs which would show major performance differences if one 
> assumes that some of its arguments are constant.
> Consider, for example, any UDF that takes a regular expression as input: This 
> can be complied once (fast) if it's a constant, or once per row (wicked slow) 
> if it's not a constant.
> Or, consider any UDF that reads from a file and/or takes a filename as input; 
> it would have to re-read the whole file if the filename changes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2327) UDFs should be made aware of their arguments are constants.

2011-07-31 Thread Adam Kramer (JIRA)
UDFs should be made aware of their arguments are constants.
---

 Key: HIVE-2327
 URL: https://issues.apache.org/jira/browse/HIVE-2327
 Project: Hive
  Issue Type: Improvement
Reporter: Adam Kramer


There are a lot of UDFs which would show major performance differences if one 
assumes that some of its arguments are constant.

Consider, for example, any UDF that takes a regular expression as input: This 
can be complied once (fast) if it's a constant, or once per row (wicked slow) 
if it's not a constant.

Or, consider any UDF that reads from a file and/or takes a filename as input; 
it would have to re-read the whole file if the filename changes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2317) Create a tool to MAP a function across an array.

2011-07-27 Thread Adam Kramer (JIRA)
Create a tool to MAP a function across an array.


 Key: HIVE-2317
 URL: https://issues.apache.org/jira/browse/HIVE-2317
 Project: Hive
  Issue Type: New Feature
Reporter: Adam Kramer


Request: A function, say FUNCTION_MAP, that will map a function (udf or native) 
across an array, returning the result. Desired syntax:

{code}
bar:
arr  foo
[1,2,3]  3
[4,5,2]  2
[7,7,6]  1

SELECT FUNCTION_MAP(arr, 'LOG2') FROM bar
{code}
...should then return
{code}
[0, 0.301, 0.477]
[0.602, 0.699, 0.301]
[0.778, 0.778, 0.845]
{code}

...ideally, FUNCTION_MAP would take additional arguments which would be passed 
to the function.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-178) SELECT without FROM should assume a one-row table with no columns.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-178:
-

Component/s: Testing Infrastructure
Description: 
SELECT 1+1;

should just return '2', but instead hive fails because no table is listed.

SELECT 1+1 FROM (empty table);

should also just return '2', but instead hive "succeeds" because there is "no 
possible output," so it produces no output.

So, currently we have to run 

SELECT 1+1 FROM (silly one-row dummy table);

...which runs a whole mapreduce step to ignore a column of data that is useless 
anyway. This is much easier due to local mode, but still, it would be nice to 
be able to SELECT without specifying a table and to get one row of output in 
moments instead of waiting for even a local-mode job to launch, complete, and 
return.

This is especially useful for testing UDFs.

Relatedly, an optimization by which Hive can tell that data from a table isn't 
even USED would be useful, because it means that the data needn't be 
queried...the only relevant info from the table would be the number of rows it 
has, which is available for free from the metastore.

  was:
SELECT 1+1;

should just return '2', but instead hive fails because no table is listed.

SELECT 1+1 FROM (empty table);

should also just return '2', but instead hive "succeeds" because there is "no 
possible output," so it produces no output.

So, currently we have to run 

SELECT 1+1 FROM (silly one-row dummy table);

...which runs a whole mapreduce step to ignore a column of data that is useless 
anyway. This is much easier due to local mode, but still, it would be nice to 
be able to SELECT without specifying a table and to get one row of output in 
moments instead of waiting for even a local-mode job to launch, complete, and 
return.

Relatedly, an optimization by which Hive can tell that data from a table isn't 
even USED would be useful, because it means that the data needn't be 
queried...the only relevant info from the table would be the number of rows it 
has, which is available for free from the metastore.


> SELECT without FROM should assume a one-row table with no columns.
> --
>
> Key: HIVE-178
> URL: https://issues.apache.org/jira/browse/HIVE-178
> Project: Hive
>  Issue Type: Wish
>  Components: Query Processor, Testing Infrastructure
>Reporter: Adam Kramer
>Priority: Minor
>  Labels: SQL
>
> SELECT 1+1;
> should just return '2', but instead hive fails because no table is listed.
> SELECT 1+1 FROM (empty table);
> should also just return '2', but instead hive "succeeds" because there is "no 
> possible output," so it produces no output.
> So, currently we have to run 
> SELECT 1+1 FROM (silly one-row dummy table);
> ...which runs a whole mapreduce step to ignore a column of data that is 
> useless anyway. This is much easier due to local mode, but still, it would be 
> nice to be able to SELECT without specifying a table and to get one row of 
> output in moments instead of waiting for even a local-mode job to launch, 
> complete, and return.
> This is especially useful for testing UDFs.
> Relatedly, an optimization by which Hive can tell that data from a table 
> isn't even USED would be useful, because it means that the data needn't be 
> queried...the only relevant info from the table would be the number of rows 
> it has, which is available for free from the metastore.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-178) SELECT without FROM should assume a one-row table with no columns.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-178:
-

Description: 
SELECT 1+1;

should just return '2', but instead hive fails because no table is listed.

SELECT 1+1 FROM (empty table);

should also just return '2', but instead hive "succeeds" because there is "no 
possible output," so it produces no output.

So, currently we have to run 

SELECT 1+1 FROM (silly one-row dummy table);

...which runs a whole mapreduce step to ignore a column of data that is useless 
anyway. This is much easier due to local mode, but still, it would be nice to 
be able to SELECT without specifying a table and to get one row of output in 
moments instead of waiting for even a local-mode job to launch, complete, and 
return.

Relatedly, an optimization by which Hive can tell that data from a table isn't 
even USED would be useful, because it means that the data needn't be 
queried...the only relevant info from the table would be the number of rows it 
has, which is available for free from the metastore.

  was:
SELECT 1+1;

should just return '2', but instead hive fails because no table is listed.

SELECT 1+1 FROM (empty table);

should also just return '2', but instead hive "succeeds" because there is "no 
possible output," so it produces no output. This is the reason I filed under 
"bug" instead of "improvement:" If it does not return '2' hive should fail.

So, currently we have to run 

SELECT 1+1 FROM (actual table with data);

...which runs a whole mapreduce step to ignore all of the data in the table 
before spitting out '2'. This will be exceptionally frustrating if hadoop makes 
us wait for mapper(s).

It would be nice to be able to SELECT without specifying a table, and perhaps 
relatedly, to be able to edit the above query (which worked) to not need to 
actually process (actual table with data) given that no data is being pulled 
from it (as we see from the SELECT statement, that table's name or alias does 
not appear).


Summary: SELECT without FROM should assume a one-row table with no 
columns.  (was: SELECT without FROM; dropping unnecessary table references)

> SELECT without FROM should assume a one-row table with no columns.
> --
>
> Key: HIVE-178
> URL: https://issues.apache.org/jira/browse/HIVE-178
> Project: Hive
>  Issue Type: Wish
>  Components: Query Processor
>Reporter: Adam Kramer
>Priority: Minor
>  Labels: SQL
>
> SELECT 1+1;
> should just return '2', but instead hive fails because no table is listed.
> SELECT 1+1 FROM (empty table);
> should also just return '2', but instead hive "succeeds" because there is "no 
> possible output," so it produces no output.
> So, currently we have to run 
> SELECT 1+1 FROM (silly one-row dummy table);
> ...which runs a whole mapreduce step to ignore a column of data that is 
> useless anyway. This is much easier due to local mode, but still, it would be 
> nice to be able to SELECT without specifying a table and to get one row of 
> output in moments instead of waiting for even a local-mode job to launch, 
> complete, and return.
> Relatedly, an optimization by which Hive can tell that data from a table 
> isn't even USED would be useful, because it means that the data needn't be 
> queried...the only relevant info from the table would be the number of rows 
> it has, which is available for free from the metastore.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-436) MIN and MAX should be generic

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-436:
-

Description: 
MIN and MAX functions currently return the DOUBLE type...but really, these 
should be generic UDFs. It makes sense to talk about minima and maxima for int, 
bigint, double, and even string types.

In some cases like SUM, it's possible that the result would overflow making 
DOUBLE more useful as it can drop digits and swap to scientific notation, but 
MIN and MAX by definition cannot have this problem because the answers are 
always represented in the column they are run across.

Easy workaround: CAST all of my MINs and MAXes from DOUBLE to INT, but these 
should work with STRING too.

  was:
MIN and MAX functions currently return the DOUBLE type...but really, they 
should return the same type as the column they operate on.

In some cases like SUM, it's possible that the result would overflow making 
DOUBLE more useful as it can drop digits and swap to scientific notation, but 
MIN and MAX by definition cannot have this problem because the answers are 
always represented in the column they are run across.

Easy workaround: CAST all of my MINs and MAXes. It's just a wish.

 Issue Type: Improvement  (was: Wish)
Summary: MIN and MAX should be generic  (was: MIN and MAX should 
inherit type)

> MIN and MAX should be generic
> -
>
> Key: HIVE-436
> URL: https://issues.apache.org/jira/browse/HIVE-436
> Project: Hive
>  Issue Type: Improvement
>  Components: UDF
>Reporter: Adam Kramer
>
> MIN and MAX functions currently return the DOUBLE type...but really, these 
> should be generic UDFs. It makes sense to talk about minima and maxima for 
> int, bigint, double, and even string types.
> In some cases like SUM, it's possible that the result would overflow making 
> DOUBLE more useful as it can drop digits and swap to scientific notation, but 
> MIN and MAX by definition cannot have this problem because the answers are 
> always represented in the column they are run across.
> Easy workaround: CAST all of my MINs and MAXes from DOUBLE to INT, but these 
> should work with STRING too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-494) Select columns by index instead of name

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-494:
-

Description: 
SELECT mytable[0], mytable[2] FROM some_table_name mytable;

...should return the first and third columns, respectively, from mytable 
regardless of their column names.

The need for "names" specifically is kind of silly when they just get 
translated into numbers anyway.

  was:
In a very real sense, tables are like arrays or matrices with rows and columns. 
IT would be fantastic if I could refer to columns in my select statement by 
their index, rather than by their name.

SELECT mytable[0], mytable[2] FROM some_table_name mytable;

...which would then get the first and third column from mytable. We already 
have syntax like this for array data types, which I think would translate 
nicely: SELECT mytable[0][3], etc.

Or maybe I just spend too much time coding in R...

   Priority: Minor  (was: Major)
Summary: Select columns by index instead of name  (was: Select columns 
by number instead of name)

> Select columns by index instead of name
> ---
>
> Key: HIVE-494
> URL: https://issues.apache.org/jira/browse/HIVE-494
> Project: Hive
>  Issue Type: Wish
>  Components: Clients, Query Processor
>Reporter: Adam Kramer
>Priority: Minor
>  Labels: SQL
>
> SELECT mytable[0], mytable[2] FROM some_table_name mytable;
> ...should return the first and third columns, respectively, from mytable 
> regardless of their column names.
> The need for "names" specifically is kind of silly when they just get 
> translated into numbers anyway.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (HIVE-602) INSERT LOCAL PIPE '/path/to/program' would be lovely

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer resolved HIVE-602.
--

Resolution: Won't Fix

Workaround sufficient.

> INSERT LOCAL PIPE '/path/to/program' would be lovely
> 
>
> Key: HIVE-602
> URL: https://issues.apache.org/jira/browse/HIVE-602
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor, Server Infrastructure
>Reporter: Adam Kramer
>
> INSERT OVERWRITE LOCAL DIRECTORY is great at what it does, but the output is 
> never _instantly_ useful. Why? Because it is a whole bunch of gzipped files.
> It would be lovely if I could tell Hive what it should do with its output 
> when inserting into a local directory. For example, to automatically pipe its 
> output through something. Really, I want to
> gunzip -c *.gz | perl -p -e 's/\cA/\t/g' > filename
> ...but I would settle for piping every reducer's output through gunzip -c | 
> perl -p -e 's/\cA/\t/g' and then having Hive save the result to whatever it 
> would have used for a filename but without the .gz.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-835) Deprecate, remove, or fix MAP and REDUCE syntax.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-835:
-

Summary: Deprecate, remove, or fix MAP and REDUCE syntax.  (was: Make MAP 
and REDUCE work as expected or add warnings)

> Deprecate, remove, or fix MAP and REDUCE syntax.
> 
>
> Key: HIVE-835
> URL: https://issues.apache.org/jira/browse/HIVE-835
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> There are syntactic elements MAP and REDUCE which function as syntactic sugar 
> for SELECT TRANSFORM. This behavior is not at all intuitive, because no 
> checking or verification is done to ensure that the user's intention is met.
> Specifically, Hive may see a MAP query and simply tack the transform script 
> on to the end of a reduce job (so, the user says MAP but hive does a REDUCE), 
> or (more dangerously) vice-versa. Given that Hive's whole point is to sit on 
> top of a mapreduce framework and allow transformations in the mapper or 
> reducer, it seems very inappropriate for Hive to ignore a clear command from 
> the user to MAP or to REDUCE the data using a script, and then simply ignore 
> it.
> Better behavior would be for hive to see a MAP command and to start a new 
> mapreduce step and run the command in the mapper (even if it otherwise would 
> be run in the reducer), and for REDUCE to begin a reduce step if necessary 
> (so, tack the REDUCE script on to the end of a REDUCE job if the current 
> system would do so, or if not, treat the 0th column as the reduce key, throw 
> a warning saying this has been done, and force a reduce job).
> Acceptable behavior would be to throw an error or warning when the user's 
> clearly-stated desire is going to be ignored. "Warning: User used MAP 
> keyword, but transformation will occur in the reduce phase" / "Warning: User 
> used REDUCE keyword, but did not specify DISTRIBUTE BY / CLUSTER BY column. 
> Transformation will occur in the map phase."

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-836) Add syntax to force a new mapreduce job / transform subquery in mapper

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-836:
-

Description: 
Hive currently does a lot of awesome work to figure out when my transformers 
should be used in the mapper and when they should be used in the reducer. 
However, sometimes I have a different plan.

For example, consider this:

{code:title=foo.sql}
SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
SELECT b.val AS val1, c.val AS val2
FROM tblb b JOIN tblc c on (b.key=c.key)
) a
{code}

...now, assume that the join step is very easy and 'niftyscript' is really 
processor intensive. The ideal format for this is a MR task with few mappers 
and few reducers, and then a second MR task with lots of mappers.

Currently, there is no way to even require the outer TRANSFORM statement occur 
in a separate map phase. Implementing a "hint" such as /* +MAP */, akin to /* 
+MAPJOIN(x) */, would be awesome.

Current workaround is to dump everything to a temporary table and then start 
over, but that is not an easy to scale--the subquery structure effectively (and 
easily) "locks" the mid-points so no other job can touch the table.

  was:
Hive currently does a lot of awesome work to figure out when my transformers 
should be used in the mapper and when they should be used in the reducer. 
However, sometimes I have a different plan.

For example, consider this:

SELECT TRANSFORM(a.val1, a.val2)
USING './niftyscript'
AS part1, part2, part3
FROM (
SELECT b.val AS val1, c.val AS val2
FROM tblb b JOIN tblc c on (b.key=c.key)
) a

...in this syntax b and c will be joined (in the reducer, of course), and then 
the rows that pass the join clause will be passed to niftyscript _in the 
reducer._ However, when niftyscript is high-computation and there is a lot of 
data coming out of the join but very few reducers, there's a huge hold-up. It 
would be awesome if I could somehow force a new mapreduce step after the 
subquery, so that ./niftyscript is run in the mappers rather than the prior 
step's reducers.

Current workaround is to dump everything to a temporary table and then start 
over, but that is not an easy to scale--the subquery structure effectively (and 
easily) "locks" the mid-points so no other job can touch the table.

SUGGESTED FIX: Either cause MAP and REDUCE to force map/reduce steps (c.f. 
https://issues.apache.org/jira/browse/HIVE-835 ), or add a query element to 
specify that "the job ends here." For example, in the above query, FROM a 
SELF-CONTAINED or PRECOMPUTE a or START JOB AFTER a or something like that.



> Add syntax to force a new mapreduce job / transform subquery in mapper
> --
>
> Key: HIVE-836
> URL: https://issues.apache.org/jira/browse/HIVE-836
> Project: Hive
>  Issue Type: Wish
>Reporter: Adam Kramer
>
> Hive currently does a lot of awesome work to figure out when my transformers 
> should be used in the mapper and when they should be used in the reducer. 
> However, sometimes I have a different plan.
> For example, consider this:
> {code:title=foo.sql}
> SELECT TRANSFORM(a.val1, a.val2)
> USING './niftyscript'
> AS part1, part2, part3
> FROM (
> SELECT b.val AS val1, c.val AS val2
> FROM tblb b JOIN tblc c on (b.key=c.key)
> ) a
> {code}
> ...now, assume that the join step is very easy and 'niftyscript' is really 
> processor intensive. The ideal format for this is a MR task with few mappers 
> and few reducers, and then a second MR task with lots of mappers.
> Currently, there is no way to even require the outer TRANSFORM statement 
> occur in a separate map phase. Implementing a "hint" such as /* +MAP */, akin 
> to /* +MAPJOIN(x) */, would be awesome.
> Current workaround is to dump everything to a temporary table and then start 
> over, but that is not an easy to scale--the subquery structure effectively 
> (and easily) "locks" the mid-points so no other job can touch the table.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-1251) TRANSFORM should allow piping or allow cross-subquery assumptions.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-1251:
--

Description: 
Many traditional transforms can be accomplished via simple unix commands 
chained together. For example, the "sort" phase is an instance of "cut -f 1 | 
sort". However, the TRANSFORM command in Hive doesn't allow for unix-style 
piping to occur.

One classic case where I wish there was piping is when I want to "stack" a 
column into several rows:

SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python 
reducer.py' AS key, value

...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In 
this case, the current workaround is this:

SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
(SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, 
col FROM table)

...the problem here is that for the above to work (and it should, indeed, work 
in a map-only MR task), I must assume that the data output from one subquery 
will be passed in EXACTLY THE SAME FORMAT to the outer query--i.e., I must 
assume that Hive will not cut a map or reduce phase in between, or "fan out" 
data from the inner query into different mappers in the outer query.

As a user, *I should not be allowed to assume* that data coming out of a 
subquery goes into the nodes for a superquery in the same order...ESPECIALLY in 
the map phase.

  was:
Many traditional transforms can be accomplished via simple unix commands 
chained together. For example, the "sort" phase is an instance of "cut -f 1 | 
sort". However, the TRANSFORM command in Hive doesn't allow for unix-style 
piping to occur.

One classic case where I wish there was piping is when I want to "stack" a 
column into several rows:

SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python 
reducer.py' AS key, value

...in this case, stacker.py would produce output of this form:
key col0
key col1
key col2
...and then the reducer would reduce the above down to one item per key. In 
this case, the current workaround is this:

SELECT TRANSFORM(a.key, a.col) USING 'python reducer.py' AS key, value FROM
(SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py' AS key, 
col FROM table)

...the problem here is that as a user, *I should not be allowed to assume* that 
the output from the inner query will be passed DIRECTLY to the outer query 
(i.e., the outer query should not assume that it gets the inner query's output 
on the same box and in the same order). I know as a programmer that this works 
fine as a pipe, but when writing Hive code I always wonder--what if Hive 
decides to run the inner query in a reduce step, and the outer query in a 
subsequent map step?

Broadly, my understanding is that the goal of Hive is to abstract the mapreduce 
process away from users. To this end, we have syntax (CLUSTER BY) that allows 
users to assume that a reduce task will occur (but see also 
https://issues.apache.org/jira/browse/HIVE-835 ), but there is no formal way to 
force or syntactically assume that the data will NOT be copied or sorted or 
transformed. I argue that the only case where this would be necessary or 
desirable would be in the instance of a pipe within a transform...ergo a desire 
for | to work as expected.

An alternative would be for the HQL language definition to explicitly state all 
conditions that would cause a task boundary to be crossed (so I can make the 
strong assumption that if none of those conditions obtains, my query will be 
supported in the future)...but that seems potentially restrictive as the 
language and Hadoop evolves.


Summary: TRANSFORM should allow piping or allow cross-subquery 
assumptions.  (was: TRANSFORM should allow pipes in some form)

> TRANSFORM should allow piping or allow cross-subquery assumptions.
> --
>
> Key: HIVE-1251
> URL: https://issues.apache.org/jira/browse/HIVE-1251
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> Many traditional transforms can be accomplished via simple unix commands 
> chained together. For example, the "sort" phase is an instance of "cut -f 1 | 
> sort". However, the TRANSFORM command in Hive doesn't allow for unix-style 
> piping to occur.
> One classic case where I wish there was piping is when I want to "stack" a 
> column into several rows:
> SELECT TRANSFORM(key, col0, col1, col2) USING 'python stacker.py | python 
> reducer.py' AS key, value
> ...in this case, stacker.py would produce output of this form:
> key col0
> key col1
> key col2
> ...and then the reducer would reduce the above down to one item per key. In 
> this case, the current workaround is this:

[jira] [Created] (HIVE-2312) Make CLI variables available to UDFs

2011-07-26 Thread Adam Kramer (JIRA)
Make CLI variables available to UDFs


 Key: HIVE-2312
 URL: https://issues.apache.org/jira/browse/HIVE-2312
 Project: Hive
  Issue Type: Improvement
  Components: CLI, Clients, UDF
Reporter: Adam Kramer


Straightforward use case: My UDFs should be able to condition on whether 
hive.mapred.mode=strict or nonstrict.

But these things could also be useful for certain optimizations. For example, a 
UDAF knowing that there is only one reduce phase could avoid a lot of pushing 
data around unnecessarily.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2311) TRANSFORM statements should come with their own ROW FORMATs.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-2311:
--

  Priority: Minor  (was: Major)
Issue Type: Bug  (was: Improvement)

> TRANSFORM statements should come with their own ROW FORMATs.
> 
>
> Key: HIVE-2311
> URL: https://issues.apache.org/jira/browse/HIVE-2311
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Reporter: Adam Kramer
>Priority: Minor
>
> Sometimes Hive tables contain tabs and/or other characters that could easily 
> be misinterpreted by a transformer as a delimiter. This can break many 
> TRANSFORM queries.
> The solution is to have a ROW FORMAT semantics that can be attached to an 
> individual TRANSFORM instance. It would have the same semantics as table 
> creation, but during serialization it would ensure that any formal delimiter 
> characters that did not indicate an actual break between columns would be 
> escaped.
> At the very least, it is a bug that TRANSFORM statement deserialization does 
> not backslash out literal tabs in the current implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2311) TRANSFORM statements should come with their own ROW FORMATs.

2011-07-26 Thread Adam Kramer (JIRA)
TRANSFORM statements should come with their own ROW FORMATs.


 Key: HIVE-2311
 URL: https://issues.apache.org/jira/browse/HIVE-2311
 Project: Hive
  Issue Type: Improvement
  Components: Serializers/Deserializers
Reporter: Adam Kramer


Sometimes Hive tables contain tabs and/or other characters that could easily be 
misinterpreted by a transformer as a delimiter. This can break many TRANSFORM 
queries.

The solution is to have a ROW FORMAT semantics that can be attached to an 
individual TRANSFORM instance. It would have the same semantics as table 
creation, but during serialization it would ensure that any formal delimiter 
characters that did not indicate an actual break between columns would be 
escaped.

At the very least, it is a bug that TRANSFORM statement deserialization does 
not backslash out literal tabs in the current implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-1466) Add NULL DEFINED AS to ROW FORMAT specification

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-1466:
--

Description: 
NULL values are passed to transformers as a literal backslash and a literal N. 
NULL values are saved when INSERT OVERWRITing LOCAL DIRECTORies as "NULL". This 
is inconsistent.

The ROW FORMAT specification of tables should be able to specify the manner in 
which a null character is represented. ROW FORMAT NULL DEFINED AS '\N' or 
'\003' or whatever should apply to all instances of table export and saving.

  was:
I just updated the Hive wiki to clarify what some would consider an oddity: 
When NULL values are exported to a script via TRANSFORM, they are converted to 
the string "\N", and then when the script's output is read, any cell that 
contains only \N is treated as a NULL value.

I believe that there are very VERY few reasons why anyone would need cells that 
contain only a backslash and then a capital N to be distinguished from NULL 
cells, but for complete generality, we should allow this.

The way to do that is probably by adding a specification in the ROW FORMAT for 
a table that would allow any string to be treated as a NULL if it is the only 
string in a cell. Some may prefer the empty string, others the word NULL in 
caps, etc. I vote for keeping \N as the default because I am used to it, but 
also for allowing this to be customized.


> Add NULL DEFINED AS to ROW FORMAT specification
> ---
>
> Key: HIVE-1466
> URL: https://issues.apache.org/jira/browse/HIVE-1466
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> NULL values are passed to transformers as a literal backslash and a literal 
> N. NULL values are saved when INSERT OVERWRITing LOCAL DIRECTORies as "NULL". 
> This is inconsistent.
> The ROW FORMAT specification of tables should be able to specify the manner 
> in which a null character is represented. ROW FORMAT NULL DEFINED AS '\N' or 
> '\003' or whatever should apply to all instances of table export and saving.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-1955) Support non-constant expressions for array indexes.

2011-07-26 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-1955:
--

Description: 
FAILED: Error in semantic analysis: line 4:8 Non Constant Expressions for Array 
Indexes not Supported dut

...just wrote my own UDF to do this, and it is trivial. We should support this 
natively.

Let foo have these rows:
arr   i
[1,2,3]   1
[3,4,5]   2
[5,4,3]   2
[0,0,1]   0

Then,
SELECT arr[i] FROM foo
should return:
2
5
3
1

Similarly, for the same table,
SELECT 3 IN arr FROM foo
should return:
true
true
true
false

...these use cases are needless limitations of functionality. We shouldn't need 
UDFs to accomplish these goals.

  was:
FAILED: Error in semantic analysis: line 4:8 Non Constant Expressions for Array 
Indexes not Supported dut

...just wrote my own UDF to do this, and it is trivial. We should support this 
natively.


> Support non-constant expressions for array indexes.
> ---
>
> Key: HIVE-1955
> URL: https://issues.apache.org/jira/browse/HIVE-1955
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> FAILED: Error in semantic analysis: line 4:8 Non Constant Expressions for 
> Array Indexes not Supported dut
> ...just wrote my own UDF to do this, and it is trivial. We should support 
> this natively.
> Let foo have these rows:
> arr   i
> [1,2,3]   1
> [3,4,5]   2
> [5,4,3]   2
> [0,0,1]   0
> Then,
> SELECT arr[i] FROM foo
> should return:
> 2
> 5
> 3
> 1
> Similarly, for the same table,
> SELECT 3 IN arr FROM foo
> should return:
> true
> true
> true
> false
> ...these use cases are needless limitations of functionality. We shouldn't 
> need UDFs to accomplish these goals.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-2231) Column aliases

2011-07-26 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071412#comment-13071412
 ] 

Adam Kramer commented on HIVE-2231:
---

The use case here is basically providing backwards compatibility. So for many 
users of a table, and many new users of a table, they are using the same table 
and want to refer to it as such; it is the canonical table.

But sometimes the table was originally named with crummy names, and it'd be 
better and cleaner to document and train new people on the appropriate names.

Views eat up the namespace and provide a level of misdirection that is not 
always desirable, but here are the two biggest limitations of views:
* SELECT * is not fast. I can't SELECT * on a view and get data immediately in 
the same way that I would upon writing the same query. This is true even when 
the schema are exactly the same.
* Partitions are not see-through. I can't use "show partitions" on a view or 
write any automated system based on the view to identify when new partitions 
land, which forces reference to the original table, and then all is lost.



> Column aliases
> --
>
> Key: HIVE-2231
> URL: https://issues.apache.org/jira/browse/HIVE-2231
> Project: Hive
>  Issue Type: Wish
>  Components: Query Processor
>Reporter: Adam Kramer
>Priority: Trivial
>
> It would be nice in several cases to be able to alias column names.
> Say someone in your company CREATEd a TABLE called important_but_named_poorly 
> (alvin BIGINT, theodore BIGINT, simon STRING) PARTITIONED BY (dave STRING), 
> that indexes the relationship between an actor (alvin), a target (theodore), 
> and the interaction between them (simon), partitioned based on the date 
> string (dave). Renaming the columns would break a million pipelines that are 
> important but ownerless.
> It would be awesome to define an aliasing system as such:
> ALTER TABLE important_but_named_poorly REPLACE COLUMNS (actor BIGINT AKA 
> alvin, target BIGINT AKA theodore, ixn STRING AKA simon) PARTITIONED BY (ds 
> STRING AKA dave);
> ...which would mean that any user could, e.g., use the term "dave" to refer 
> to ds if they really wanted to.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-934) Rows loaded incorrect; should be suppressed.

2011-07-26 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071408#comment-13071408
 ] 

Adam Kramer commented on HIVE-934:
--

Resolved, right? Feel free to close.

> Rows loaded incorrect; should be suppressed.
> 
>
> Key: HIVE-934
> URL: https://issues.apache.org/jira/browse/HIVE-934
> Project: Hive
>  Issue Type: Bug
>Reporter: Adam Kramer
>Assignee: Ning Zhang
>
> For several queries, Hive reports "rows loaded" at the bottom, e.g.,
> 928955 Rows loaded to akramer_mem_updates2
> ...however, this number is not always correct. "Rows loaded" should be the 
> same as the number of rows in the table after the table is created. In the 
> above case, select count(1) from akramer_mem_updates2 returns 2649223; this 
> is incorrect.
> This has been noted for a long time; the basic response to reports of this 
> problem is "Yeah, rows loaded is wrong, you should ignore it." If this is so, 
> it should stop being reported by Hive entirely (or it should only be reported 
> in cases where it is correct).
> Or, it could be fixed. :)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2295) Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query level.

2011-07-20 Thread Adam Kramer (JIRA)
Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query 
level.
--

 Key: HIVE-2295
 URL: https://issues.apache.org/jira/browse/HIVE-2295
 Project: Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Adam Kramer


The common framework for utilizing the mapreduce framework looks like this:

SELECT TRANSFORM(a.foo, a.bar)
USING 'mapper.py'
AS x, y, z
FROM (
  SELECT b.foo, b.bar
  FROM tablename b
  CLUSTER BY b.foo
) a;

...however, this is exceptionally fragile, as it relies on the assumption that 
Hive is not doing any "magic" in between the query steps. People familiar with 
SQL frequently assume that query steps are effectively separated from each 
other. CLUSTER BY, then, would guarantee that data are clustered on their way 
OUT of the query, but really what we need is a directive to indicate that data 
must be clustered on the way INTO the query.

This is not pedantic, because there is no reason that Hive wouldn't try to 
optimize data flow between queries, for example, systematically splitting up 
big queries. The UDAF framework, with its merging step, would allow a single 
key/value pair to be split across SEVERAL reducers, "violating" the mapreduce 
assumptions but returning the correct data...however, for a TRANSFORM 
statement, no such protections are afforded.

I propose, for greater clarity, that these directives be part of the same query 
level. Example syntax:

SELECT TRANSFORM(foo, bar)
USING 'reducer.py'
AS x, y, z
FROM tablename
CLUSTERED BY foo;

...in other words, move the directive regarding data distribution to the query 
that actually cares about it, allowing for users who are making the assumptions 
of the mapreduce framework to formally indicate that their transformer really 
DOES need clustered data. Or to put it in other words, CLUSTER BY is a 
directive guaranteeing that data are clustered on the way OUT OF a query (i.e., 
for bucketed tables), whereas CLUSTERED BY is a directive guaranteeing that 
data are clustered on the way INTO a query.

Bonus points: For tables that are already CLUSTERED BY in their definition, 
allow this query to run in the map phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-1360) Allow UDFs to access constant parameter values at compile time

2011-07-09 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062645#comment-13062645
 ] 

Adam Kramer commented on HIVE-1360:
---

Glad this is getting some traction! Ideally, the "constancy" here should be 
detectable in the init/initialize state, and, if it IS, the constant should be 
accessible on initialization.

> Allow UDFs to access constant parameter values at compile time
> --
>
> Key: HIVE-1360
> URL: https://issues.apache.org/jira/browse/HIVE-1360
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor, UDF
>Affects Versions: 0.5.0
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> UDFs should be able to access constant parameter values at compile time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-1731) Improve miscellaneous error messages

2011-06-22 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053480#comment-13053480
 ] 

Adam Kramer commented on HIVE-1731:
---

Syed, can you also link the other task to finish the other messages here? I'll 
start posting my new complaints about error messages there now since this one's 
done. :)

Here's one:

hive> select count(1), b.age from t1 a join t2 b ON a.key = b.key and 
a.ds='2011-06-10' AND b.ds='2011-06-10';
FAILED: Error in semantic analysis: Line 1:17 Expression not in GROUP BY key 
'age'

...this should say, 'b.age' and not just 'age.' Why? Because it's different 
from this:

hive> select count(1), b.age from t1 a join t2 b ON a.key = b.key and 
a.ds='2011-06-10' AND b.ds='2011-06-10' GROUP BY age;
FAILED: Error in semantic analysis: Line 1:17 Expression not in GROUP BY key 
'age'

...which is many times more maddening than the first error message.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Diagnosability, Query Processor
>Reporter: John Sichi
>Assignee: Syed S. Albiz
> Fix For: 0.7.1, 0.8.0
>
> Attachments: 0001-Fixed-parse-error-line-number-issue.patch, 
> 0001-Fixed-parse-error-line-number-issue.patch, 
> 0002-update-test-cases-with-error-string-results.patch, HIVE-1731.1.patch, 
> HIVE-1731.2.patch, HIVE-1731.3.patch, HIVE-1731.4.patch, HIVE-1731.5.patch, 
> HIVE-1731.6.patch, HIVE-1731.7-0.7.patch, HIVE-1731.7.patch, 
> HIVE-1731.8-0.7.patch
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2232) Any query to a partition column should access the metastore and not the data

2011-06-21 Thread Adam Kramer (JIRA)
Any query to a partition column should access the metastore and not the data


 Key: HIVE-2232
 URL: https://issues.apache.org/jira/browse/HIVE-2232
 Project: Hive
  Issue Type: Improvement
  Components: Metastore, Query Processor, Server Infrastructure
Reporter: Adam Kramer
Priority: Minor


The metastore contains all of the data on the possible values, etc., for all 
partition columns (including subpartitions). So, any query that actually reads 
or uses data from partition columns should avoid table scans.

For example:

CREATE TABLE t1 (value1 STRING) PARTITIONED ON (ds STRING, key STRING);
CREATE TABLE t2 (key STRING, value2 STRING) PARTITIONED ON (ds STRING);

...

SELECT t2.key, t1.value1, t2.value2 FROM t1 JOIN t2 ON t1.key=t2.key AND 
t1.ds='2010-01-01' AND t2.ds='2010-01-01';

...ideally, the JOIN in this case would operate very very quickly without 
scanning every row of t1--because every value of t1.key is in the metastore 
because it is a partition column. This is just one example. Partition pruning 
is another example that currently works well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HIVE-2231) Column aliases

2011-06-21 Thread Adam Kramer (JIRA)
Column aliases
--

 Key: HIVE-2231
 URL: https://issues.apache.org/jira/browse/HIVE-2231
 Project: Hive
  Issue Type: Wish
  Components: Query Processor
Reporter: Adam Kramer
Priority: Trivial


It would be nice in several cases to be able to alias column names.

Say someone in your company CREATEd a TABLE called important_but_named_poorly 
(alvin BIGINT, theodore BIGINT, simon STRING) PARTITIONED BY (dave STRING), 
that indexes the relationship between an actor (alvin), a target (theodore), 
and the interaction between them (simon), partitioned based on the date string 
(dave). Renaming the columns would break a million pipelines that are important 
but ownerless.

It would be awesome to define an aliasing system as such:

ALTER TABLE important_but_named_poorly REPLACE COLUMNS (actor BIGINT AKA alvin, 
target BIGINT AKA theodore, ixn STRING AKA simon) PARTITIONED BY (ds STRING AKA 
dave);

...which would mean that any user could, e.g., use the term "dave" to refer to 
ds if they really wanted to.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HIVE-1731) Improve miscellaneous error messages

2011-04-14 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020040#comment-13020040
 ] 

Adam Kramer commented on HIVE-1731:
---

SELECT a.id, VAR(a.cnt) FROM mytable a
FAILED: Error in semantic analysis: line 1:94 Expression Not In Group By Key a

...what this error message should say is "Function VAR is either undefined or 
it is not an aggregation function."
...and even if this were a GOOD message, it means "expression not in group by 
key a.cnt"...just saying a is useless.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Diagnosability, Query Processor
>Reporter: John Sichi
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-306) Support "INSERT [INTO] destination"

2011-04-08 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017586#comment-13017586
 ] 

Adam Kramer commented on HIVE-306:
--

Ping?

> Support "INSERT [INTO] destination"
> ---
>
> Key: HIVE-306
> URL: https://issues.apache.org/jira/browse/HIVE-306
> Project: Hive
>  Issue Type: New Feature
>Reporter: Zheng Shao
>
> Currently hive only supports "INSERT OVERWRITE destination". We should 
> support "INSERT [INTO] destination".

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1263) Is it time for INSERT APPEND?

2011-04-08 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017584#comment-13017584
 ] 

Adam Kramer commented on HIVE-1263:
---

Closing in favor of HIVE-306

> Is it time for INSERT APPEND?
> -
>
> Key: HIVE-1263
> URL: https://issues.apache.org/jira/browse/HIVE-1263
> Project: Hive
>  Issue Type: New Feature
>Reporter: Adam Kramer
>
> It seems to me that for someone who did not want to wipe an existing table 
> but just wanted to add rows to it, that an INSERT APPEND may be desirable. I 
> would expect that the mechanism by which this would take place is the whole 
> normal mapreduce step as if an OVERWRITE were taking place, but the end 
> result would be adding files to the hdfs directory rather than wiping the 
> directory and replacing it with a new one.
> Or failing that, INSERT APPEND could be syntactic sugar for an INSERT 
> OVERWRITE on the same table, but with the SELECT statement replaced with the 
> user-specified SELECT and UNION ALL'd on select * for the existing 
> table...which is my current workaround, but seems like an inelegant query.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (HIVE-1263) Is it time for INSERT APPEND?

2011-04-08 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer resolved HIVE-1263.
---

Resolution: Duplicate

Hive-306

> Is it time for INSERT APPEND?
> -
>
> Key: HIVE-1263
> URL: https://issues.apache.org/jira/browse/HIVE-1263
> Project: Hive
>  Issue Type: New Feature
>Reporter: Adam Kramer
>
> It seems to me that for someone who did not want to wipe an existing table 
> but just wanted to add rows to it, that an INSERT APPEND may be desirable. I 
> would expect that the mechanism by which this would take place is the whole 
> normal mapreduce step as if an OVERWRITE were taking place, but the end 
> result would be adding files to the hdfs directory rather than wiping the 
> directory and replacing it with a new one.
> Or failing that, INSERT APPEND could be syntactic sugar for an INSERT 
> OVERWRITE on the same table, but with the SELECT statement replaced with the 
> user-specified SELECT and UNION ALL'd on select * for the existing 
> table...which is my current workaround, but seems like an inelegant query.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-446) Implement TRUNCATE

2011-04-08 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017580#comment-13017580
 ] 

Adam Kramer commented on HIVE-446:
--

One year later, I still want this feature.

Here's a spec:

TRUNCATE TABLE tablename;
...would delete all the data in it (simple hadoop dfs -rmr), OR drop all of the 
partitions if it is a partitioned table (less simple).

TRUNCATE TABLE tablename PARTITION(ds='2011-01-01') would delete all the data 
in the partition.

> Implement TRUNCATE
> --
>
> Key: HIVE-446
> URL: https://issues.apache.org/jira/browse/HIVE-446
> Project: Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Prasad Chakka
>
> truncate the data but leave the table and metadata intact.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1731) Improve miscellaneous error messages

2011-04-04 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015646#comment-13015646
 ] 

Adam Kramer commented on HIVE-1731:
---

hive> alter table my_table create partition(ds='2011-08-01');
FAILED: Parse Error: line 1:12 cannot recognize input 'my_table' in alter table 
statement

...what it should say is, "I don't know what create partition means." If I 
change the word "create" to "add," this works.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Diagnosability, Query Processor
>Reporter: John Sichi
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1199) configure total number of mappers

2011-03-25 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011302#comment-13011302
 ] 

Adam Kramer commented on HIVE-1199:
---

+1. This is also a bigger issue for automation of jobs that require tweaking 
the amount of resources. I have a job right now that needs about 10x the number 
of mappers to run smoothly, and I would like to pipeline it, but the data size 
is growing...so if I configure the split sizes, I need to do so based on 
today's size of the table. That should be handled by Hive.

Ideally, this would mean that the split.sizes are generated or recomputed 
dynamically. One variable, mapred.map.tasks.approx, could be set or 
unset...then Hive could do some quick math based on the size of the table and 
dynamically set its own mapred.max.split.size and min.split.size to get 
approximately the desired number of mappers. Doesn't have to be perfect in 
order to be useful!

> configure total number of mappers
> -
>
> Key: HIVE-1199
> URL: https://issues.apache.org/jira/browse/HIVE-1199
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>
> For users, it might be very difficult to control the number of mappers. There 
> are many parameters which confuses the users - 
> for CombineHiveInputFormat, a different set of parameters is required to 
> control the number of mappers.
> In general, users should have a way to specify the total number of mappers, 
> which should be obeyed. This will be very difficult
> to guarantee, since the query might be reading from a large number of 
> partitions, where a mapper can only span one partition.
> What if the number of mappers that the user wants is less than the total 
> number of partitions ?
> It would be a very hueristic to have - a simple usecase that Joy had is as 
> follows:
> A query needs to be run on one table, which has a lot of small files - it 
> will be easy for him to specify the total number of mappers
> rather than the various rac local/node local combinefileinputformat 
> parameters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-1920) DESCRIBE with comments is difficult to read

2011-03-23 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010514#comment-13010514
 ] 

Adam Kramer commented on HIVE-1920:
---

Default SHOULD show comments, but there needs to be an option to NOT show 
comments...

> DESCRIBE with comments is difficult to read
> ---
>
> Key: HIVE-1920
> URL: https://issues.apache.org/jira/browse/HIVE-1920
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 0.7.0
>Reporter: Paul Yang
>Priority: Minor
> Attachments: HIVE-1920.1.nocomment.patch
>
>
> When DESCRIBE is run, comments for columns are displayed next to the column 
> type. A problem with this is that if the comment contains line breaks, it is 
> difficult to differentiate the columns from the comments and is difficult to 
> read.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2011-03-09 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004917#comment-13004917
 ] 

Adam Kramer commented on HIVE-1731:
---

FAILED - LOCKS ON THE UNDERLYING OBJECTS CANNOT BE ACQUIRED. RETRY AFTER SOME 
TIME

should be

FAILED - Somebody is currently running a query that will replace this table or 
partition. Retry after some time.

or

FAILED - Somebody is currently reading from a table or partition you are 
attempting to overwrite. Retry after some time.

...depending on whether the user is trying to read or write, respectively. 

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Diagnosability, Query Processor
>Reporter: John Sichi
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HIVE-1490) More implicit type conversion: UNION ALL and COALESCE

2011-03-02 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001903#comment-13001903
 ] 

Adam Kramer commented on HIVE-1490:
---

Also present in CASE statements:

FAILED: Error in semantic analysis: line 37:57 Argument Type Mismatch '$.foo': 
The expressions after THEN should have the same type: "int" is expected but 
"string" is found


> More implicit type conversion: UNION ALL and COALESCE
> -
>
> Key: HIVE-1490
> URL: https://issues.apache.org/jira/browse/HIVE-1490
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Server Infrastructure
>Reporter: Adam Kramer
>
> This is a usecase that frequently annoys me:
> SELECT TRANSFORM(stuff)
> USING 'script'
> AS thing1, thing2
> FROM some_table
> UNION ALL
> SELECT a.thing1, a.thing2
> FROM some_other_table a
> ...this fails when a.thing1 and a.thing2 are anything but STRING, because all 
> output of TRANSFORM is STRING.
> In this case, a.thing1 and a.thing2 should be implicitly converted to string.
> COALESCE(a.thing1, a.thing2, a.thing3) should similarly do implicit type 
> conversion among the arguments. If two are INT and one is BIGINT, upgrade the 
> INTs, etc.
> At the very least, it would be nice to have syntax like
> SELECT TRANSFORM(stuff)
> USING 'script'
> AS thing1 INT, thing2 INT
> ...which would effectively cast the output column to the specified type. But 
> really, type conversion should work.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-138) Provide option to export a HEADER

2011-02-28 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000531#comment-13000531
 ] 

Adam Kramer commented on HIVE-138:
--

The syntax specified here doesn't work...can you update the Hive wiki to 
reflect how to get a header?

> Provide option to export a HEADER
> -
>
> Key: HIVE-138
> URL: https://issues.apache.org/jira/browse/HIVE-138
> Project: Hive
>  Issue Type: Improvement
>  Components: Clients, Query Processor
>Reporter: Adam Kramer
>Assignee: Paul Butler
>Priority: Minor
> Attachments: HIVE-138.patch
>
>
> When writing data to directories or files for later analysis, or when 
> exploring data in the hive CLI with raw SELECT statements, it'd be great if 
> we could get a "header" or something so we know which columns our output 
> comes from. Any chance this is easy to add? Just print the column names (or 
> formula used to generate them) in the first row?
> SELECT foo.* WITH HEADER FROM some_table foo limit 3;
> col1col2col3
> 1   9   6
> 7   5   0
> 7   5   3
> SELECT f.col1-f.col2, col3 WITH HEADER FROM some_table foo limit 3;
> f.col1-f.col2 col3
> -8 6
> 2 0
> 2 3
> ...etc

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2011-02-25 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999479#comment-12999479
 ] 

Adam Kramer commented on HIVE-1731:
---

@John: Which of my messages are you talking about? Also, what's critical mass 
on this jira?

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1994) Support new annotation @UDFType(stateful = true)

2011-02-14 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994581#comment-12994581
 ] 

Adam Kramer commented on HIVE-1994:
---

Agree; also consider deprecating DISTRIBUTE/SORT/CLUSTER BY in favor of 
DISTRIBUTED/SORTED/CLUSTERED BY, a syntax that would explicitly prevent 
short-circuiting and subdivision for only the query it's a part of.

I can't imagine that "sort by in the subquery leads to assumptions in the 
parent query" scales well or will last long in any case, but this functionality 
is not only necessary for backwards-compatibility, but is also kind of the 
entire reason Hive was developed and/or conceived: To utilize mapreduce 
functionality in order to transform and process data. Preventing the querier 
from making mapreduce assumptions just seems sad.

> Support new annotation @UDFType(stateful = true)
> 
>
> Key: HIVE-1994
> URL: https://issues.apache.org/jira/browse/HIVE-1994
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor, UDF
>Reporter: John Sichi
>Assignee: John Sichi
>
> Because Hive does not yet support window functions from SQL/OLAP, people have 
> started hacking around it by writing stateful UDF's for things like 
> cumulative sum.  An example is row_sequence in contrib.
> To clearly mark these, I think we should add a new annotation (with separate 
> semantics from the existing deterministic annotation).  I'm proposing the 
> name stateful for lack of a better idea, but I'm open to suggestions.
> The semantics are as follows:
> * A stateful UDF can only be used in the SELECT list, not in other clauses 
> such as WHERE/ON/ORDER/GROUP
> * When a stateful UDF is present in a query, there's an implication that its 
> SELECT needs to be treated as similar to TRANSFORM, i.e. when there's 
> DISTRIBUTE/CLUSTER/SORT clause, then run inside the corresponding reducer to 
> make sure that the results are as expected.
> For the first one, an example of why we need this is AND/OR short-circuiting; 
> we don't want these optimizations to cause the invocation to be skipped in a 
> confusing way, so we should just ban it outright (which is what SQL/OLAP does 
> for window functions).
> For the second one, I'm not entirely certain about the details since some of 
> it is lost in the mists in Hive prehistory, but at least if we have the 
> annotation, we'll be able to preserve backwards compatibility as we start 
> adding new cost-based optimizations which might otherwise break it.  A 
> specific example would be inserting a materialization step (e.g. for global 
> query optimization) in between the DISTRIBUTE/CLUSTER/SORT and the outer 
> SELECT containing the stateful UDF invocation; this could be a problem if the 
> mappers in the second job subdivides the buckets generated by the first job.  
> So we wouldn't do anything immediately, but the presence of the annotation 
> will help us going forward.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (HIVE-1334) Add PERCENTILE for continuous (double) distributions

2011-02-07 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer resolved HIVE-1334.
---

Resolution: Fixed

PERCENTILE_APPROX was added ages ago.

> Add PERCENTILE for continuous (double) distributions
> 
>
> Key: HIVE-1334
> URL: https://issues.apache.org/jira/browse/HIVE-1334
> Project: Hive
>  Issue Type: New Feature
>Reporter: Adam Kramer
>Priority: Minor
>
> As with the fresh-off-the-presses 
> https://issues.apache.org/jira/browse/HIVE-259 ...but for double 
> distributions.
> Oracle spec is at 
> http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions110.htm
>  for this. I don't think it should be much more trouble than the first 
> version with simple linear imputation.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (HIVE-1490) More implicit type conversion: UNION ALL and COALESCE

2011-02-07 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer updated HIVE-1490:
--

Issue Type: Bug  (was: Improvement)

Based on 
http://wiki.apache.org/hadoop/Hive/HiveQL/Types#Implicit_and_Explicit_Type_Conversions
 this appears to actually be a bug rather than an improvement.

> More implicit type conversion: UNION ALL and COALESCE
> -
>
> Key: HIVE-1490
> URL: https://issues.apache.org/jira/browse/HIVE-1490
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Server Infrastructure
>Reporter: Adam Kramer
>
> This is a usecase that frequently annoys me:
> SELECT TRANSFORM(stuff)
> USING 'script'
> AS thing1, thing2
> FROM some_table
> UNION ALL
> SELECT a.thing1, a.thing2
> FROM some_other_table a
> ...this fails when a.thing1 and a.thing2 are anything but STRING, because all 
> output of TRANSFORM is STRING.
> In this case, a.thing1 and a.thing2 should be implicitly converted to string.
> COALESCE(a.thing1, a.thing2, a.thing3) should similarly do implicit type 
> conversion among the arguments. If two are INT and one is BIGINT, upgrade the 
> INTs, etc.
> At the very least, it would be nice to have syntax like
> SELECT TRANSFORM(stuff)
> USING 'script'
> AS thing1 INT, thing2 INT
> ...which would effectively cast the output column to the specified type. But 
> really, type conversion should work.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1955) Support non-constant expressions for array indexes.

2011-02-07 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991708#comment-12991708
 ] 

Adam Kramer commented on HIVE-1955:
---

Also, IN should operate on non-constant arrays.

> Support non-constant expressions for array indexes.
> ---
>
> Key: HIVE-1955
> URL: https://issues.apache.org/jira/browse/HIVE-1955
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> FAILED: Error in semantic analysis: line 4:8 Non Constant Expressions for 
> Array Indexes not Supported dut
> ...just wrote my own UDF to do this, and it is trivial. We should support 
> this natively.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-478) Surface "processor time" for queries

2011-02-07 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991686#comment-12991686
 ] 

Adam Kramer commented on HIVE-478:
--

Sorry for the month-long delay. This is all I need. But it would be great, in 
general, to report this in the standard way:

Time taken: 185.89 seconds (23,194,570 CPU_MILLISECONDS)

...in the CLI. Otherwise ok to mark issue resolved.

> Surface "processor time" for queries
> 
>
> Key: HIVE-478
> URL: https://issues.apache.org/jira/browse/HIVE-478
> Project: Hive
>  Issue Type: Wish
>  Components: Logging, Query Processor
>Reporter: Adam Kramer
>
> We currently list real-time metrics of how long queries take--"finished in: 
> 1min 13sec" appears on the job tracker. However, this is affected by a lot 
> more than just the quality or implementation of the query. For example, 
> number of mappers used varies a lot when you use subqueries versus 
> single-query aggregation, as does the amount of work necessary.
> For implementation comparisons (e.g., "should I use this version of the query 
> or that one"), ti would be great to know the processor time used instead of 
> the real time used...both in terms of "mapper cpu seconds" and "reducer cpu 
> seconds."

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2011-02-04 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990742#comment-12990742
 ] 

Adam Kramer commented on HIVE-1731:
---

hive> show partitios table_name_here ;
FAILED: Parse Error: line 1:5 rule kwRole failed predicate: 
{input.LT(1).getText().equalsIgnoreCase("role")}? in show role grants

...this is a typo. I meant "partitions." This error message did not help me 
understand that at all and provides misleading information.

Should be

FAILED: Parse Error: line 1:5 predicate "partitios" not understood; failed rule 
kwRole...

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (HIVE-1955) Support non-constant expressions for array indexes.

2011-02-03 Thread Adam Kramer (JIRA)
Support non-constant expressions for array indexes.
---

 Key: HIVE-1955
 URL: https://issues.apache.org/jira/browse/HIVE-1955
 Project: Hive
  Issue Type: Improvement
Reporter: Adam Kramer


FAILED: Error in semantic analysis: line 4:8 Non Constant Expressions for Array 
Indexes not Supported dut

...just wrote my own UDF to do this, and it is trivial. We should support this 
natively.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1784) Ctrl+c should kill currently running query, but not exit the CLI

2011-01-26 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987412#action_12987412
 ] 

Adam Kramer commented on HIVE-1784:
---

Duplicate of https://issues.apache.org/jira/browse/HIVE-243 ...been on the 
to-do list since early 2009.

> Ctrl+c should kill currently running query, but not exit the CLI
> 
>
> Key: HIVE-1784
> URL: https://issues.apache.org/jira/browse/HIVE-1784
> Project: Hive
>  Issue Type: Improvement
>  Components: CLI
>Affects Versions: 0.7.0
>Reporter: Paul Yang
>Priority: Minor
>
> When a query is running and Ctrl+C is pressed, the query is killed and the 
> CLI is exited. Instead, Ctrl+c should kill the query but return the user to 
> the Hive prompt. This will make it easier to modify and re-submit the query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HIVE-1919) Allow ARRAY_CONTAINS to be used as join key

2011-01-18 Thread Adam Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kramer resolved HIVE-1919.
---

Resolution: Won't Fix

Just realized that this is not in fact an equi-join...and there is no way to 
actually delete issues here. Hmph.

> Allow ARRAY_CONTAINS to be used as join key
> ---
>
> Key: HIVE-1919
> URL: https://issues.apache.org/jira/browse/HIVE-1919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Adam Kramer
>
> Using ARRAY_CONTAINS(b.haystack, a.needle) fails with the "Both Left and 
> Right Aliases Encountered in Join" error. But, it doesn't have to...it is 
> effectively an equi-join, just more complicated. The appropriate 
> functionality can be attained using EXPLODE and GROUP BY, but that is 
> ridiculous in terms of amount of work generated for the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1919) Allow ARRAY_CONTAINS to be used as join key

2011-01-18 Thread Adam Kramer (JIRA)
Allow ARRAY_CONTAINS to be used as join key
---

 Key: HIVE-1919
 URL: https://issues.apache.org/jira/browse/HIVE-1919
 Project: Hive
  Issue Type: Improvement
Reporter: Adam Kramer


Using ARRAY_CONTAINS(b.haystack, a.needle) fails with the "Both Left and Right 
Aliases Encountered in Join" error. But, it doesn't have to...it is effectively 
an equi-join, just more complicated. The appropriate functionality can be 
attained using EXPLODE and GROUP BY, but that is ridiculous in terms of amount 
of work generated for the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2011-01-16 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982419#action_12982419
 ] 

Adam Kramer commented on HIVE-1731:
---

FAILED: Error in semantic analysis: AS clause has an invalid number of aliases

...this should provide the line number and column number on which the invalid 
AS clause begins and ends. Subqueries mean there could be more than one.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2010-12-16 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972300#action_12972300
 ] 

Adam Kramer commented on HIVE-1731:
---

hive> describe table_that_does_not_exist;
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask

...this error message should say something like "Table does not exist."

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-934) Rows loaded incorrect; should be suppressed.

2010-12-14 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971438#action_12971438
 ] 

Adam Kramer commented on HIVE-934:
--

Another option, provided by Greg, would be to actually disclose where this 
number comes from. If it's an estimate based on something, have the message 
read "estimated X lines based on Y."

> Rows loaded incorrect; should be suppressed.
> 
>
> Key: HIVE-934
> URL: https://issues.apache.org/jira/browse/HIVE-934
> Project: Hive
>  Issue Type: Bug
>Reporter: Adam Kramer
>
> For several queries, Hive reports "rows loaded" at the bottom, e.g.,
> 928955 Rows loaded to akramer_mem_updates2
> ...however, this number is not always correct. "Rows loaded" should be the 
> same as the number of rows in the table after the table is created. In the 
> above case, select count(1) from akramer_mem_updates2 returns 2649223; this 
> is incorrect.
> This has been noted for a long time; the basic response to reports of this 
> problem is "Yeah, rows loaded is wrong, you should ignore it." If this is so, 
> it should stop being reported by Hive entirely (or it should only be reported 
> in cases where it is correct).
> Or, it could be fixed. :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2010-12-13 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970972#action_12970972
 ] 

Adam Kramer commented on HIVE-1731:
---

FAILED: Parse Error: line 0:-1 mismatched input '' expecting ) in subquery 
source

No error that refers to lines 0:-1 is ever useful. Here is an example query:

INSERT OVERWRITE TABLE my_table
SELECT TRANSFORM(b.user1, b.user2, b.cnt)
USING '{tr}'
AS c1,c2,c3,c4
FROM (
SELECT b.user1, b.user2, b.cnt FROM (
SELECT user1, user2, COUNT(1) AS cnt
FROM sourcetable
WHERE ds > '2010-12-01' AND ds <= '2010-12-07'
GROUP BY user1, user2
DISTRIBUTE BY user1 SORT BY user1, cnt
) b;

...the problem here is that the inner query is not indented and lacks a ). This 
error message should report the error as being at the ;, which is to say 12:4. 
Since the message knows an rparen is missing, it should also provide the index 
of the lparen. It should read like this:

FAILED: Parse Error: line 12:4 mismatched input '' expecting ) in subquery 
source to close ( at line 5:6.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2010-12-09 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969995#action_12969995
 ] 

Adam Kramer commented on HIVE-1731:
---

https://issues.apache.org/jira/browse/HIVE-1839 also.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1839) Error message for "Both Left and Right Aliases Encountered in Join time" cites wrong row/col

2010-12-07 Thread Adam Kramer (JIRA)
Error message for "Both Left and Right Aliases Encountered in Join time" cites 
wrong row/col


 Key: HIVE-1839
 URL: https://issues.apache.org/jira/browse/HIVE-1839
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Adam Kramer


In all cases of the above error, the error message looks like this:

FAILED: Error in semantic analysis: line 0:-1 Both Left and Right Aliases 
Encountered in Join time

...the 0:-1 is incorrect. This should provide the row and the column number.

Ideally, it would also provide the textual left and right aliases so that the 
user could identify which aliases are encountered where since this is rarely 
obvious.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1731) Improve miscellaneous error messages

2010-10-26 Thread Adam Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925098#action_12925098
 ] 

Adam Kramer commented on HIVE-1731:
---

>From a UNION ALL query:

FAILED: Error in semantic analysis: Schema of both sides of union should match: 
destinationid:_col1 _col2

...this should 1) provide a line number where the error is, 2) say how the 
schemata mismatch, and 3) use actual column names. destinationid is an actual 
column name, but I have no idea what _col1 and _col2 refer to.

When I have 10 UNION ALLs on top of each other, this error message is very 
aggravating.

> Improve miscellaneous error messages
> 
>
> Key: HIVE-1731
> URL: https://issues.apache.org/jira/browse/HIVE-1731
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: John Sichi
> Fix For: 0.7.0
>
>
> This is a place for accumulating error message improvements so that we can 
> update a bunch in batch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-1703) BOOL types should implicitly cast to INT

2010-10-12 Thread Adam Kramer (JIRA)
BOOL types should implicitly cast to INT


 Key: HIVE-1703
 URL: https://issues.apache.org/jira/browse/HIVE-1703
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Adam Kramer
Priority: Minor


>From the Wiki:
"Otherwise, the operator is probably a UDF/UDAF function. In that case, we will 
try to convert the parameters to the types that are accepted by the UDF/UDAF 
function. If the UDF/UDAF function is overloaded (with more than 1 
implementations with different types), we will try to find the one with least 
number of type conversions needed."

However,
SELECT SUM(thing=otherthing) FROM table
...fails, because thing=otherthing is a bool, and there is no system by which 
BOOL would convert to INT, as it should. INT is higher precision, so this 
should always work. Explicit casting, SUM(CAST(thing=otherthing AS INT)) works 
just fine.

(yes, in this simple case COUNT(1) WHERE thing=otherthing would do the job, but 
it serves to illustrate the bug.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.