[jira] Commented: (HIVE-50) Tag columns as partitioning columns

2010-02-11 Thread E. Sammer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832576#action_12832576
 ] 

E. Sammer commented on HIVE-50:
---

Rather than having two places to define columns, it seems like it would be 
nicer to specify all columns once and then reference them in the partitioned by 
clause.

Ex:

CREATE TABLE events ( year int, month int, day int, event_type int, user_id int 
) PARTITIONED BY ( year, month, day );

One of the side effects of this is that the column output order is defined 
solely by the actual column definition. As of 0.4.1, partitioned columns are 
always after normal columns which is annoying for cases where you want to 
expect query output to match the source files on which the query was run in 
terms of layout without having to have some explicit external ordering 
knowledge. In other words, partitioned columns should only be special to the 
query parser / optimizer and directory structures. Today, partitioning creates 
a requirement on the field ordering in files which violates the notion that 
there is no Hive file format and means reformatting files needlessly to 
partition them. Partitioned columns should be able to appear anywhere in the 
file layout.

 Tag columns as partitioning columns
 ---

 Key: HIVE-50
 URL: https://issues.apache.org/jira/browse/HIVE-50
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Venky Iyer

 CREATE TABLE tname (INT cname1, INT pcol PARTITIONING )
 COMMENT 'This is a table' 
 PARTITIONED BY(dt STRING) 
 STORED AS SEQUENCEFILE; 
 The goal here is to annotate a column as being a partitioning column. 
 Consider pcol in the above example. It is annotated with 'PARTITIONING', 
 which implies that the create table
 has 
 PARTITIONED BY (dt, pcol)
 and every write to this table has implicitly
 INSERT OVERWRITE tname PARTITION (pcol='X')
 WHERE output.pcol = 'X'
 for every distinct value X that pcol takes.
 This is ideally an addition on top of the explicit partitioning that is 
 already in the syntax, so that if I said
 INSERT OVERWRITE tname PARTITION (dt='D')
 it would still go into the partition (dt='D, pcol='Y') when the value of 
 pcol is Y.
 It would be up to the user to make sure the cardinality of these columns is 
 reasonable, and that enough data goes into each partition that there is some 
 net benefit (just as it is in the explicit case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-887) Allow SELECT col without a mapreduce job

2010-01-19 Thread E. Sammer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802311#action_12802311
 ] 

E. Sammer commented on HIVE-887:


Ning:

That sounds great. The syntax isn't really important. I was just trying to 
think of a way of providing a hint to the query execution layer so the user can 
indicate when they're willing to take the performance hit in favor of a faster 
initial response. I would think that in many cases you still probably want to 
exec as a MR job even for small results; this fetch behavior should be a user 
requested special case, in my opinion.

 Allow SELECT col without a mapreduce job
 --

 Key: HIVE-887
 URL: https://issues.apache.org/jira/browse/HIVE-887
 Project: Hadoop Hive
  Issue Type: New Feature
 Environment: All
Reporter: Eric Sun
Assignee: Ning Zhang

 I often find myself needing to take a quick look at a particular column of a 
 Hive table.
 I usually do this by doing a 
 SELECT * from table LIMIT 20;
 from the CLI.  Doing this is pretty fast since it doesn't require a mapreduce 
 job.  However, it's tough to examine just 1 or 2 columns when the table is 
 very wide.
 So, I might do
 SELECT col from table LIMIT 20;
 but it's much slower since it requires a map-reduce.  It'd be really 
 convenient if a map-reduce wasn't necessary.
 Currently a good work around is to do
 hive -e select * from table | cut --key=n
 but it'd be more convenient if it were built in since it alleviates the need 
 for column counting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-887) Allow SELECT col without a mapreduce job

2010-01-18 Thread E. Sammer (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802089#action_12802089
 ] 

E. Sammer commented on HIVE-887:


I would also like this kind of functionality. I would add WHERE clause support 
to the request though as there are cases where you know a table will be small. 
What would be really ideal is to be able to define a projected threshold where, 
if the query execution engine think there may be many rows, it resorts to a MR 
job, but if under, performs client side fetch and filter. The expectation is 
that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause a 
MR job.

Ex:

SELECT a, b FROM t WHERE c = 'foo' FETCH n;

where n is an upper limit for which a fetch should be done based on the 
projected number of rows. If projection is still not yet on the table in Hive 
(I haven't looked at the internals), maybe FETCH n acts like a fetch + limit 
operation. Maybe n is simply some global configuration parameter, although that 
seems too inflexible.

For me, Hive has been excellent for storing raw parsed log data which can be 
queried into summary tables of around 1 million rows. These summary tables 
containing aggregations are then queried by a UI for visualization. This 
fetch functionality would allow for the UI load times to go from minutes to 
seconds and reduce contention for task slots in a production Hadoop cluster.

 Allow SELECT col without a mapreduce job
 --

 Key: HIVE-887
 URL: https://issues.apache.org/jira/browse/HIVE-887
 Project: Hadoop Hive
  Issue Type: New Feature
 Environment: All
Reporter: Eric Sun
Assignee: Ning Zhang

 I often find myself needing to take a quick look at a particular column of a 
 Hive table.
 I usually do this by doing a 
 SELECT * from table LIMIT 20;
 from the CLI.  Doing this is pretty fast since it doesn't require a mapreduce 
 job.  However, it's tough to examine just 1 or 2 columns when the table is 
 very wide.
 So, I might do
 SELECT col from table LIMIT 20;
 but it's much slower since it requires a map-reduce.  It'd be really 
 convenient if a map-reduce wasn't necessary.
 Currently a good work around is to do
 hive -e select * from table | cut --key=n
 but it'd be more convenient if it were built in since it alleviates the need 
 for column counting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.