[jira] Created: (HIVE-352) Make Hive support column based storage
Make Hive support column based storage -- Key: HIVE-352 URL: https://issues.apache.org/jira/browse/HIVE-352 Project: Hadoop Hive Issue Type: New Feature Reporter: he yongqiang column based storage has been proven a better storage layout for OLAP. Hive does a great job on raw row oriented storage. In this issue, we will enhance hive to support column based storage. Acctually we have done some work on column based storage on top of hdfs, i think it will need some review and refactoring to port it to Hive. Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Hudson build is back to normal: Hive-trunk-h0.17 #34
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/34/changes
Can I specify a query in a test to see execution trace?
Hello, Is there a simple test where I can specify a query and see the execution trace under Eclipse Debug mode? Is there any test that interactively asks for a query? Thanks, shyam_sar...@yahoo.com
[jira] Commented: (HIVE-352) Make Hive support column based storage
[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682716#action_12682716 ] Joydeep Sen Sarma commented on HIVE-352: thanks for taking this on. this could be pretty awesome. traditionally the arguments for columnar storage has been limited 'scan bandwidth' and compression. In practice - we see that scan bandwidth has two components: 1. disk/file-system bandwidth to read data 2. compute cost to scan data most columnar stores optimize for both (especially because in shared disk architectures - #1 is at premium). However - our limited experience says is that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it is possible that this observation applies because of high hadoop/java compute overheads - regardless - this seems to be reality). Given this - i like the idea of a scheme where columns are stored as independent streams inside a block oriented file format (each file block contains a set of rows, however - the organization inside blocks is by column). This does not optimize for #1 - but does optimize for #2 (potentially in conjunction with Hive's interfaces to get one column at a time from IO Libraries). It also gives us nearly equivalent compression. (The alternative scheme of having different file(s) per column is also complicated by the fact that locality is almost impossible to ensure and there is no reasonable ways of asking hdfs to colocate different file segments in the near future). -- i would love to understand how you are planning to approach this. will we still use sequencefiles as a container - or should we ditch it? (it wasn't a great fit for hive - given that we don't use the key field - but the best thing we could find). We have seen that having a number of open codecs can hurt in memory usage - that's one open question for me - can we actually afford to open N concurrent compressed streams (assuming each column is stored compressed separately). It also seems that one could define a ColumnarInputFormat/OutputFormat as a generic api with different implementations and different pluggable containers underneath - and a scheme of either file per column or columnar in a block approach. in that sense we could build something more generic for hadoop (and then just make sure that hive's lazy serde uses the columnar api for data access - instead of the row based api exposed by current inputformat). Make Hive support column based storage -- Key: HIVE-352 URL: https://issues.apache.org/jira/browse/HIVE-352 Project: Hadoop Hive Issue Type: New Feature Reporter: he yongqiang column based storage has been proven a better storage layout for OLAP. Hive does a great job on raw row oriented storage. In this issue, we will enhance hive to support column based storage. Acctually we have done some work on column based storage on top of hdfs, i think it will need some review and refactoring to port it to Hive. Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-78) Authentication infrastructure for Hive
[ https://issues.apache.org/jira/browse/HIVE-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682719#action_12682719 ] Edward Capriolo commented on HIVE-78: - We also have to look at this on the file system level. For example, files in my warehouse are owned by the user who created the table. {quote} /user/hive/warehouse/edward dir 2008-10-30 17:13 rwxr-xr-x edward supergroup {quote} Regardless of what permissions are granted in the metastore (via this jira), hadoop ACL governs what a user can do to that file. This is not an issue in mysql. In a typical mysql deployment all of the data files are owned by a mysql user. I do not see a clear cut solution for this. In one scenario we make sure all the files in the warehouse are owned RW to all, or owned by a specific user. A component like HiveServer, CLI, or HWI would decide if the user action would succeed based on the meta data. The other option is that an operation like 'GRANT SELECT' would have to physically modify the Hadoop ACL/owner. This method will not help us get the fine grained control we desire. Authentication infrastructure for Hive -- Key: HIVE-78 URL: https://issues.apache.org/jira/browse/HIVE-78 Project: Hadoop Hive Issue Type: New Feature Components: Server Infrastructure Reporter: Ashish Thusoo Assignee: Edward Capriolo Allow hive to integrate with existing user repositories for authentication and authorization infromation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-352) Make Hive support column based storage
[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682740#action_12682740 ] he yongqiang commented on HIVE-352: --- Thanks, Joydeep Sen Sarma. Your feedback is really important. 1. store schema. block-wise column store or one file per column. Our current implementation stores each column in one file. And the most annoying part for us, just as you said, is that currently and even in near future, hdfs does not support to colocate different file segements for columns in a same table. So some operations need to fetch data from a new file(like a mapside hash join, a join with CompositeInputFormat) or need to add new map reduce job to merge data together. Some operations are pretty good for this. I think block-wise column is a good point. I will try to imprement it nearly. With different columns collocated in a single block, some operations do not need a reduce part(which is really time-consuming). 2. compression With different columns in different files, some light weight compressions,such as RLE, dictionay and bit vector encoding, can be used. One benefit of these light weight compression algorithms is that some operations does not need to decompression the data. If we implement the block-wise column storage, should we also need to specify the light weight compression algorithm for each column or we choose one( like RLE) internally if the data is of good cluster nature? Since dictionary and bit vector should also be supported, the comlumns with these compression algorithms should be also placed in the block-wise columnar file? I think placing these columns in seperate files can be handled more easily? But i do not know whether it can fit into Hive. I am new to Hive. {quote} having a number of open codecs can hurt in memory usage {quote} currently I can not think up a solution to avoid this for column per file store. 3.file format yeah. i think we need to add new file formats and their corresponding InputFormats. Currently, we have implemented the VFile(Value File, we do not need to store a key part), and BitMapFile. We have not implemented a DictionayFile, instead we use a header file for VFile to store dictionary entries. The header file for VFile is not needed for some columns and sometimes it is must. I think the refactor of file formats should be the start for this issue. Thanks again. Make Hive support column based storage -- Key: HIVE-352 URL: https://issues.apache.org/jira/browse/HIVE-352 Project: Hadoop Hive Issue Type: New Feature Reporter: he yongqiang column based storage has been proven a better storage layout for OLAP. Hive does a great job on raw row oriented storage. In this issue, we will enhance hive to support column based storage. Acctually we have done some work on column based storage on top of hdfs, i think it will need some review and refactoring to port it to Hive. Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-347) [hive] lot of mappers due to a user error while specifying the partitioning column
[ https://issues.apache.org/jira/browse/HIVE-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain updated HIVE-347: Resolution: Fixed Fix Version/s: 0.3.0 Status: Resolved (was: Patch Available) committed. [hive] lot of mappers due to a user error while specifying the partitioning column -- Key: HIVE-347 URL: https://issues.apache.org/jira/browse/HIVE-347 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.3.0 Attachments: hive.347.1.patch, hive.347.2.patch, hive.347.3.patch A common scenario when the table is partitioned on 'ds' column which is of type 'string' of a certain format '-mm-dd' However, if the user forgets to add quotes while specifying the query: select ... from T where ds = 2009-02-02 2009-02-02 is a valid integer expression. So, partition pruning makes all partitions unknown, since 2009-02-02 to double conversion is null. If all partitions are unknown, in strict mode, we should thrown an error -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-350) [Hive] wrong order in explain plan
[ https://issues.apache.org/jira/browse/HIVE-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682816#action_12682816 ] Namit Jain commented on HIVE-350: - The explain plan is not wrong perse - the order is random, and eventually the select corrects the order. But, it was very difficult to understand [Hive] wrong order in explain plan -- Key: HIVE-350 URL: https://issues.apache.org/jira/browse/HIVE-350 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.350.1.patch In case of multiple aggregations, the explain plan might be wrong -the order of aggregations since AbParseInfo maintains the information in a hashmap, which does the guarantee the results to be returned in order -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-333) Add TFileTransport deserializer
[ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joydeep Sen Sarma reassigned HIVE-333: -- Assignee: Joydeep Sen Sarma Add TFileTransport deserializer --- Key: HIVE-333 URL: https://issues.apache.org/jira/browse/HIVE-333 Project: Hadoop Hive Issue Type: New Feature Components: Serializers/Deserializers Environment: Linux Reporter: Steve Corona Assignee: Joydeep Sen Sarma I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the struct type) in a TFileTransport. Hopefully this makes sense... Reply from Joydeep Sen Sarma (jssa...@facebook.com) Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward). No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED
SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!
[jira] Created: (HIVE-353) Comments can't have semi-colons
Comments can't have semi-colons --- Key: HIVE-353 URL: https://issues.apache.org/jira/browse/HIVE-353 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: S. Alex Smith Priority: Minor hive CREATE TABLE tmp_foo(foo DOUBLE COMMENT ';'); FAILED: Parse Error: line 2:7 mismatched input 'TABLE' expecting TEMPORARY in create function statement hive CREATE TABLE tmp_foo(foo DOUBLE); OK -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
JIRA_patch-251_1.txt_UNIT_TEST_FAILED
ERROR: UNIT TEST using PATCH patch-251_1.txt FAILED!! [junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED BUILD FAILED
JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED
SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!
[jira] Commented: (HIVE-251) Failures in Transform don't stop the job
[ https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682866#action_12682866 ] Namit Jain commented on HIVE-251: - +1 looks good. Before committing, can you remove the extra commented line in the reduce script, and add a comment explaining what it is doing Failures in Transform don't stop the job Key: HIVE-251 URL: https://issues.apache.org/jira/browse/HIVE-251 Project: Hadoop Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 0.2.0 Reporter: S. Alex Smith Assignee: Ashish Thusoo Priority: Blocker Fix For: 0.3.0 Attachments: patch-251.txt, patch-251_1.txt If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a non-zero exit status, Hive proceeds as if nothing bad happened. The main way that the user knows something bad has happened is if the user checks the logs (probably because he got no output). This is doubly bad if the program only fails part of the time (say, on certain inputs) since the job will still produce output and thus the problem will likely go undetected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-354) [hive] udf needed for getting length of a string
[hive] udf needed for getting length of a string Key: HIVE-354 URL: https://issues.apache.org/jira/browse/HIVE-354 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-251) Failures in Transform don't stop the job
[ https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashish Thusoo updated HIVE-251: --- Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) committed. Also made the changes suggested by Namit. Failures in Transform don't stop the job Key: HIVE-251 URL: https://issues.apache.org/jira/browse/HIVE-251 Project: Hadoop Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 0.2.0 Reporter: S. Alex Smith Assignee: Ashish Thusoo Priority: Blocker Fix For: 0.3.0 Attachments: patch-251.txt, patch-251_1.txt If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a non-zero exit status, Hive proceeds as if nothing bad happened. The main way that the user knows something bad has happened is if the user checks the logs (probably because he got no output). This is doubly bad if the program only fails part of the time (say, on certain inputs) since the job will still produce output and thus the problem will likely go undetected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.