[jira] Created: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread he yongqiang (JIRA)
Make Hive support column based storage
--

 Key: HIVE-352
 URL: https://issues.apache.org/jira/browse/HIVE-352
 Project: Hadoop Hive
  Issue Type: New Feature
Reporter: he yongqiang


column based storage has been proven a better storage layout for OLAP. 
Hive does a great job on raw row oriented storage. In this issue, we will 
enhance hive to support column based storage. 
Acctually we have done some work on column based storage on top of hdfs, i 
think it will need some review and refactoring to port it to Hive.


Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hudson build is back to normal: Hive-trunk-h0.17 #34

2009-03-17 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/34/changes




Can I specify a query in a test to see execution trace?

2009-03-17 Thread Shyam Sarkar

Hello,

Is there a simple test where I can specify a query and see the execution trace 
under Eclipse Debug mode? Is there any test that interactively asks for a query?


Thanks,
shyam_sar...@yahoo.com


  


[jira] Commented: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682716#action_12682716
 ] 

Joydeep Sen Sarma commented on HIVE-352:


thanks for taking this on. this could be pretty awesome.

traditionally the arguments for columnar storage has been limited 'scan 
bandwidth' and compression. In practice - we see that scan bandwidth has two 
components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data

most columnar stores optimize for both (especially because in shared disk 
architectures - #1 is at premium). However - our limited experience says is 
that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it 
is possible that this observation applies because of high hadoop/java compute 
overheads - regardless - this seems to be reality).

Given this - i like the idea of a scheme where columns are stored as 
independent streams inside a block oriented file format (each file block 
contains a set of rows, however - the organization inside blocks is by column). 
This does not optimize for #1 - but does optimize for #2 (potentially in 
conjunction with Hive's interfaces to get one column at a time from IO 
Libraries). It also gives us nearly equivalent compression.

(The alternative scheme of having  different file(s) per column is also 
complicated by the fact that locality is almost impossible to ensure and there 
is no reasonable ways of asking hdfs to colocate different file segments in the 
near future).

--

i would love to understand how you are planning to approach this. will we still 
use sequencefiles as a container - or should we ditch it? (it wasn't a great 
fit for hive - given that we don't use the key field - but the best thing we 
could find). We have seen that having a number of open codecs can hurt in 
memory usage - that's one open question for me - can we actually afford to open 
N concurrent compressed streams (assuming each column is stored compressed 
separately).

It also seems that one could define a ColumnarInputFormat/OutputFormat as a 
generic api with different implementations and different pluggable containers 
underneath - and a scheme of either file per column or columnar in a block 
approach. in that sense we could build something more generic for hadoop (and 
then just make sure that hive's lazy serde uses the columnar api for data 
access - instead of the row based api exposed by current inputformat).

 Make Hive support column based storage
 --

 Key: HIVE-352
 URL: https://issues.apache.org/jira/browse/HIVE-352
 Project: Hadoop Hive
  Issue Type: New Feature
Reporter: he yongqiang

 column based storage has been proven a better storage layout for OLAP. 
 Hive does a great job on raw row oriented storage. In this issue, we will 
 enhance hive to support column based storage. 
 Acctually we have done some work on column based storage on top of hdfs, i 
 think it will need some review and refactoring to port it to Hive.
 Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-78) Authentication infrastructure for Hive

2009-03-17 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682719#action_12682719
 ] 

Edward Capriolo commented on HIVE-78:
-

We also have to look at this on the file system level. For example, files in my 
warehouse are owned by the user who created the table.

{quote}
/user/hive/warehouse/edward  dir   2008-10-30 17:13
rwxr-xr-x   edward supergroup
{quote}

Regardless of what permissions are granted in the metastore (via this jira), 
hadoop ACL governs what a user can do to that file. 

This is not an issue in mysql. In a typical mysql deployment all of the data 
files are owned by a mysql user. 

I do not see a clear cut solution for this. 

In one scenario we make sure all the files in the warehouse are owned RW to 
all, or owned by a specific user. A component like HiveServer, CLI, or HWI 
would decide if the user action would succeed based on the meta data.

The other option is that an operation like 'GRANT SELECT' would have to 
physically modify the Hadoop ACL/owner. This method will not help us get the 
fine grained control we desire.
 

 Authentication infrastructure for Hive
 --

 Key: HIVE-78
 URL: https://issues.apache.org/jira/browse/HIVE-78
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Server Infrastructure
Reporter: Ashish Thusoo
Assignee: Edward Capriolo

 Allow hive to integrate with existing user repositories for authentication 
 and authorization infromation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread he yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682740#action_12682740
 ] 

he yongqiang commented on HIVE-352:
---

Thanks, Joydeep Sen Sarma. Your feedback is really important.

1. store schema.  block-wise column store or one file per column.
Our current implementation stores each column in one file. And the most 
annoying part for us, just as you said, is that currently and even in near 
future, hdfs does not support to colocate different file segements for columns 
in a same table.  So some operations need to fetch data from a new file(like a 
mapside hash join, a join with CompositeInputFormat) or need to add new map 
reduce job to merge data together.  Some operations are pretty good for this. 
I think block-wise column is a good point. I will try to imprement it nearly. 
With different columns collocated in a single block, some operations do not 
need a reduce part(which is really time-consuming).

2. compression
With different columns in different files, some light weight compressions,such 
as RLE, dictionay and bit vector encoding, can be used. One benefit of these 
light weight compression algorithms is that some operations does not need to 
decompression the data.
If we implement the block-wise column storage, should we also need to specify 
the light weight compression algorithm for each column or we choose one( like 
RLE) internally if the data is of good cluster nature? Since dictionary and bit 
vector should also be supported, the comlumns with these compression algorithms 
should be also placed in the block-wise columnar file? I think placing these 
columns in seperate files can be handled more easily? But i do not know whether 
it can fit into Hive. I am new to Hive.
{quote}
having a number of open codecs can hurt in memory usage
{quote}
currently I can not think up a solution to avoid this for column per file store.

3.file format
yeah. i think we need to add new file formats and their corresponding 
InputFormats. Currently, we have implemented the VFile(Value File, we do not 
need to store a key part), and BitMapFile. We have not implemented a 
DictionayFile, instead we use a header file for VFile to store dictionary 
entries. The header file for VFile is not needed for some columns and sometimes 
it is must. 
I think the refactor of file formats should be the start for this issue.

Thanks again.


 Make Hive support column based storage
 --

 Key: HIVE-352
 URL: https://issues.apache.org/jira/browse/HIVE-352
 Project: Hadoop Hive
  Issue Type: New Feature
Reporter: he yongqiang

 column based storage has been proven a better storage layout for OLAP. 
 Hive does a great job on raw row oriented storage. In this issue, we will 
 enhance hive to support column based storage. 
 Acctually we have done some work on column based storage on top of hdfs, i 
 think it will need some review and refactoring to port it to Hive.
 Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-347) [hive] lot of mappers due to a user error while specifying the partitioning column

2009-03-17 Thread Namit Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-347:


   Resolution: Fixed
Fix Version/s: 0.3.0
   Status: Resolved  (was: Patch Available)

committed.

 [hive] lot of mappers due to a user error while specifying the partitioning 
 column
 --

 Key: HIVE-347
 URL: https://issues.apache.org/jira/browse/HIVE-347
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Fix For: 0.3.0

 Attachments: hive.347.1.patch, hive.347.2.patch, hive.347.3.patch


 A common scenario when the table is partitioned on 'ds' column which is of 
 type 'string' of a certain format '-mm-dd'
 However, if the user forgets to add quotes while specifying the query:
 select ... from T where ds = 2009-02-02
 2009-02-02 is a valid integer expression. So, partition pruning makes all 
 partitions unknown, since 2009-02-02 to double conversion is null.
 If all partitions are unknown, in strict mode, we should thrown an error

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-350) [Hive] wrong order in explain plan

2009-03-17 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682816#action_12682816
 ] 

Namit Jain commented on HIVE-350:
-

The explain plan is not wrong perse - the order is random, and eventually the 
select corrects the order. But, it was very difficult to understand

 [Hive] wrong order in explain plan
 --

 Key: HIVE-350
 URL: https://issues.apache.org/jira/browse/HIVE-350
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Attachments: hive.350.1.patch


 In case of multiple aggregations, the explain plan might be wrong -the order 
 of aggregations since AbParseInfo maintains the information in a hashmap, 
 which does the guarantee the results to be returned in order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-333) Add TFileTransport deserializer

2009-03-17 Thread Joydeep Sen Sarma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joydeep Sen Sarma reassigned HIVE-333:
--

Assignee: Joydeep Sen Sarma

 Add TFileTransport deserializer
 ---

 Key: HIVE-333
 URL: https://issues.apache.org/jira/browse/HIVE-333
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Serializers/Deserializers
 Environment: Linux
Reporter: Steve Corona
Assignee: Joydeep Sen Sarma

 I've been googling around all night and havn't really found what I am looking 
 for. Basically, I want to transfer some data from my web servers to hive  in 
 a format that's a little more verbose than plain CSV files. It seems like 
 JSON or thrift would be perfect for this. I am planning on sending this 
 serialized json or thrift data through scribe and loading it into Hive.. I 
 just can't figure out how to tell hive that the input data is a bunch of 
 serialized thrift records (all of the records are the struct type)  in a 
 TFileTransport. Hopefully this makes sense...
 Reply from Joydeep Sen Sarma (jssa...@facebook.com)
 Unfortunately the open source code base does not have the loaders we run to 
 convert thrift records in a tfiletransport into a sequencefile that 
 hadoop/hive can work with. One option is that we add this to Hive code base 
 (should be straightforward).
 No process required. Please file a jira - I will try to upload a patch this 
 weekend (just cut'n'paste for most part). Would appreciate some help in 
 finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

2009-03-17 Thread Murli Varadachari

SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!



[jira] Created: (HIVE-353) Comments can't have semi-colons

2009-03-17 Thread S. Alex Smith (JIRA)
Comments can't have semi-colons
---

 Key: HIVE-353
 URL: https://issues.apache.org/jira/browse/HIVE-353
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: S. Alex Smith
Priority: Minor



hive CREATE TABLE tmp_foo(foo DOUBLE COMMENT ';');
FAILED: Parse Error: line 2:7 mismatched input 'TABLE' expecting TEMPORARY in 
create function statement

hive CREATE TABLE tmp_foo(foo DOUBLE);
OK


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



JIRA_patch-251_1.txt_UNIT_TEST_FAILED

2009-03-17 Thread Murli Varadachari

ERROR: UNIT TEST using PATCH patch-251_1.txt FAILED!!

[junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED
BUILD FAILED


JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

2009-03-17 Thread Murli Varadachari

SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!



[jira] Commented: (HIVE-251) Failures in Transform don't stop the job

2009-03-17 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682866#action_12682866
 ] 

Namit Jain commented on HIVE-251:
-

+1

looks good. 

Before committing, can you remove the extra commented line in the reduce 
script, and add a comment explaining what it is doing

 Failures in Transform don't stop the job
 

 Key: HIVE-251
 URL: https://issues.apache.org/jira/browse/HIVE-251
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.2.0
Reporter: S. Alex Smith
Assignee: Ashish Thusoo
Priority: Blocker
 Fix For: 0.3.0

 Attachments: patch-251.txt, patch-251_1.txt


 If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a 
 non-zero exit status, Hive proceeds as if nothing bad happened.  The main way 
 that the user knows something bad has happened is if the user checks the logs 
 (probably because he got no output).  This is doubly bad if the program only 
 fails part of the time (say, on certain inputs) since the job will still 
 produce output and thus the problem will likely go undetected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HIVE-354) [hive] udf needed for getting length of a string

2009-03-17 Thread Namit Jain (JIRA)
[hive] udf needed for getting length of a string


 Key: HIVE-354
 URL: https://issues.apache.org/jira/browse/HIVE-354
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-251) Failures in Transform don't stop the job

2009-03-17 Thread Ashish Thusoo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Thusoo updated HIVE-251:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

committed. Also made the changes suggested by Namit.

 Failures in Transform don't stop the job
 

 Key: HIVE-251
 URL: https://issues.apache.org/jira/browse/HIVE-251
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.2.0
Reporter: S. Alex Smith
Assignee: Ashish Thusoo
Priority: Blocker
 Fix For: 0.3.0

 Attachments: patch-251.txt, patch-251_1.txt


 If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a 
 non-zero exit status, Hive proceeds as if nothing bad happened.  The main way 
 that the user knows something bad has happened is if the user checks the logs 
 (probably because he got no output).  This is doubly bad if the program only 
 fails part of the time (say, on certain inputs) since the job will still 
 produce output and thus the problem will likely go undetected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.