[jira] Commented: (HIVE-50) Tag columns as partitioning columns
[ https://issues.apache.org/jira/browse/HIVE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832576#action_12832576 ] E. Sammer commented on HIVE-50: --- Rather than having two places to define columns, it seems like it would be nicer to specify all columns once and then reference them in the partitioned by clause. Ex: CREATE TABLE events ( year int, month int, day int, event_type int, user_id int ) PARTITIONED BY ( year, month, day ); One of the side effects of this is that the column output order is defined solely by the actual column definition. As of 0.4.1, partitioned columns are always after normal columns which is annoying for cases where you want to expect query output to match the source files on which the query was run in terms of layout without having to have some explicit external ordering knowledge. In other words, partitioned columns should only be special to the query parser / optimizer and directory structures. Today, partitioning creates a requirement on the field ordering in files which violates the notion that there is no Hive file format and means reformatting files needlessly to partition them. Partitioned columns should be able to appear anywhere in the file layout. Tag columns as partitioning columns --- Key: HIVE-50 URL: https://issues.apache.org/jira/browse/HIVE-50 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Venky Iyer CREATE TABLE tname (INT cname1, INT pcol PARTITIONING ) COMMENT 'This is a table' PARTITIONED BY(dt STRING) STORED AS SEQUENCEFILE; The goal here is to annotate a column as being a partitioning column. Consider pcol in the above example. It is annotated with 'PARTITIONING', which implies that the create table has PARTITIONED BY (dt, pcol) and every write to this table has implicitly INSERT OVERWRITE tname PARTITION (pcol='X') WHERE output.pcol = 'X' for every distinct value X that pcol takes. This is ideally an addition on top of the explicit partitioning that is already in the syntax, so that if I said INSERT OVERWRITE tname PARTITION (dt='D') it would still go into the partition (dt='D, pcol='Y') when the value of pcol is Y. It would be up to the user to make sure the cardinality of these columns is reasonable, and that enough data goes into each partition that there is some net benefit (just as it is in the explicit case). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-887) Allow SELECT col without a mapreduce job
[ https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802311#action_12802311 ] E. Sammer commented on HIVE-887: Ning: That sounds great. The syntax isn't really important. I was just trying to think of a way of providing a hint to the query execution layer so the user can indicate when they're willing to take the performance hit in favor of a faster initial response. I would think that in many cases you still probably want to exec as a MR job even for small results; this fetch behavior should be a user requested special case, in my opinion. Allow SELECT col without a mapreduce job -- Key: HIVE-887 URL: https://issues.apache.org/jira/browse/HIVE-887 Project: Hadoop Hive Issue Type: New Feature Environment: All Reporter: Eric Sun Assignee: Ning Zhang I often find myself needing to take a quick look at a particular column of a Hive table. I usually do this by doing a SELECT * from table LIMIT 20; from the CLI. Doing this is pretty fast since it doesn't require a mapreduce job. However, it's tough to examine just 1 or 2 columns when the table is very wide. So, I might do SELECT col from table LIMIT 20; but it's much slower since it requires a map-reduce. It'd be really convenient if a map-reduce wasn't necessary. Currently a good work around is to do hive -e select * from table | cut --key=n but it'd be more convenient if it were built in since it alleviates the need for column counting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-887) Allow SELECT col without a mapreduce job
[ https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802089#action_12802089 ] E. Sammer commented on HIVE-887: I would also like this kind of functionality. I would add WHERE clause support to the request though as there are cases where you know a table will be small. What would be really ideal is to be able to define a projected threshold where, if the query execution engine think there may be many rows, it resorts to a MR job, but if under, performs client side fetch and filter. The expectation is that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause a MR job. Ex: SELECT a, b FROM t WHERE c = 'foo' FETCH n; where n is an upper limit for which a fetch should be done based on the projected number of rows. If projection is still not yet on the table in Hive (I haven't looked at the internals), maybe FETCH n acts like a fetch + limit operation. Maybe n is simply some global configuration parameter, although that seems too inflexible. For me, Hive has been excellent for storing raw parsed log data which can be queried into summary tables of around 1 million rows. These summary tables containing aggregations are then queried by a UI for visualization. This fetch functionality would allow for the UI load times to go from minutes to seconds and reduce contention for task slots in a production Hadoop cluster. Allow SELECT col without a mapreduce job -- Key: HIVE-887 URL: https://issues.apache.org/jira/browse/HIVE-887 Project: Hadoop Hive Issue Type: New Feature Environment: All Reporter: Eric Sun Assignee: Ning Zhang I often find myself needing to take a quick look at a particular column of a Hive table. I usually do this by doing a SELECT * from table LIMIT 20; from the CLI. Doing this is pretty fast since it doesn't require a mapreduce job. However, it's tough to examine just 1 or 2 columns when the table is very wide. So, I might do SELECT col from table LIMIT 20; but it's much slower since it requires a map-reduce. It'd be really convenient if a map-reduce wasn't necessary. Currently a good work around is to do hive -e select * from table | cut --key=n but it'd be more convenient if it were built in since it alleviates the need for column counting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.