date:20090317

[jira] Created: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread he yongqiang (JIRA)

Make Hive support column based storage
--

 Key: HIVE-352
 URL: https://issues.apache.org/jira/browse/HIVE-352
 Project: Hadoop Hive
  Issue Type: New Feature
Reporter: he yongqiang


column based storage has been proven a better storage layout for OLAP. 
Hive does a great job on raw row oriented storage. In this issue, we will 
enhance hive to support column based storage. 
Acctually we have done some work on column based storage on top of hdfs, i 
think it will need some review and refactoring to port it to Hive.


Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Hudson build is back to normal: Hive-trunk-h0.17 #34

2009-03-17 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/34/changes

Can I specify a query in a test to see execution trace?

2009-03-17 Thread Shyam Sarkar


Hello,

Is there a simple test where I can specify a query and see the execution trace 
under Eclipse Debug mode? Is there any test that interactively asks for a query?


Thanks,
shyam_sar...@yahoo.com

[jira] Commented: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread Joydeep Sen Sarma (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682716#action_12682716
]

Joydeep Sen Sarma commented on HIVE-352:

thanks for taking this on. this could be pretty awesome.

traditionally the arguments for columnar storage has been limited 'scan
bandwidth' and compression. In practice - we see that scan bandwidth has two
components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data

most columnar stores optimize for both (especially because in shared disk
architectures - #1 is at premium). However - our limited experience says is
that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it
is possible that this observation applies because of high hadoop/java compute
overheads - regardless - this seems to be reality).

Given this - i like the idea of a scheme where columns are stored as
independent streams inside a block oriented file format (each file block
contains a set of rows, however - the organization inside blocks is by column).
This does not optimize for #1 - but does optimize for #2 (potentially in
conjunction with Hive's interfaces to get one column at a time from IO
Libraries). It also gives us nearly equivalent compression.

(The alternative scheme of having different file(s) per column is also
complicated by the fact that locality is almost impossible to ensure and there
is no reasonable ways of asking hdfs to colocate different file segments in the
near future).

i would love to understand how you are planning to approach this. will we still
use sequencefiles as a container - or should we ditch it? (it wasn't a great
fit for hive - given that we don't use the key field - but the best thing we
could find). We have seen that having a number of open codecs can hurt in
memory usage - that's one open question for me - can we actually afford to open
N concurrent compressed streams (assuming each column is stored compressed
separately).

It also seems that one could define a ColumnarInputFormat/OutputFormat as a
generic api with different implementations and different pluggable containers
underneath - and a scheme of either file per column or columnar in a block
approach. in that sense we could build something more generic for hadoop (and
then just make sure that hive's lazy serde uses the columnar api for data
access - instead of the row based api exposed by current inputformat).

Make Hive support column based storage
--

Key: HIVE-352
URL: https://issues.apache.org/jira/browse/HIVE-352
Project: Hadoop Hive
Issue Type: New Feature
Reporter: he yongqiang

column based storage has been proven a better storage layout for OLAP.
Hive does a great job on raw row oriented storage. In this issue, we will
enhance hive to support column based storage.
Acctually we have done some work on column based storage on top of hdfs, i
think it will need some review and refactoring to port it to Hive.
Any thoughts?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-78) Authentication infrastructure for Hive

2009-03-17 Thread Edward Capriolo (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-78?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682719#action_12682719
]

Edward Capriolo commented on HIVE-78:
-

We also have to look at this on the file system level. For example, files in my
warehouse are owned by the user who created the table.

{quote}
/user/hive/warehouse/edward dir 2008-10-30 17:13
rwxr-xr-x edward supergroup
{quote}

Regardless of what permissions are granted in the metastore (via this jira),
hadoop ACL governs what a user can do to that file.

This is not an issue in mysql. In a typical mysql deployment all of the data
files are owned by a mysql user.

I do not see a clear cut solution for this.

In one scenario we make sure all the files in the warehouse are owned RW to
all, or owned by a specific user. A component like HiveServer, CLI, or HWI
would decide if the user action would succeed based on the meta data.

The other option is that an operation like 'GRANT SELECT' would have to
physically modify the Hadoop ACL/owner. This method will not help us get the
fine grained control we desire.

Authentication infrastructure for Hive
--

Key: HIVE-78
URL: https://issues.apache.org/jira/browse/HIVE-78
Project: Hadoop Hive
Issue Type: New Feature
Components: Server Infrastructure
Reporter: Ashish Thusoo
Assignee: Edward Capriolo

Allow hive to integrate with existing user repositories for authentication
and authorization infromation.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

2009-03-17 Thread he yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682740#action_12682740
]

he yongqiang commented on HIVE-352:
---

Thanks, Joydeep Sen Sarma. Your feedback is really important.

1. store schema. block-wise column store or one file per column.
Our current implementation stores each column in one file. And the most
annoying part for us, just as you said, is that currently and even in near
future, hdfs does not support to colocate different file segements for columns
in a same table. So some operations need to fetch data from a new file(like a
mapside hash join, a join with CompositeInputFormat) or need to add new map
reduce job to merge data together. Some operations are pretty good for this.
I think block-wise column is a good point. I will try to imprement it nearly.
With different columns collocated in a single block, some operations do not
need a reduce part(which is really time-consuming).

2. compression
With different columns in different files, some light weight compressions,such
as RLE, dictionay and bit vector encoding, can be used. One benefit of these
light weight compression algorithms is that some operations does not need to
decompression the data.
If we implement the block-wise column storage, should we also need to specify
the light weight compression algorithm for each column or we choose one( like
RLE) internally if the data is of good cluster nature? Since dictionary and bit
vector should also be supported, the comlumns with these compression algorithms
should be also placed in the block-wise columnar file? I think placing these
columns in seperate files can be handled more easily? But i do not know whether
it can fit into Hive. I am new to Hive.
{quote}
having a number of open codecs can hurt in memory usage
{quote}
currently I can not think up a solution to avoid this for column per file store.

3.file format
yeah. i think we need to add new file formats and their corresponding
InputFormats. Currently, we have implemented the VFile(Value File, we do not
need to store a key part), and BitMapFile. We have not implemented a
DictionayFile, instead we use a header file for VFile to store dictionary
entries. The header file for VFile is not needed for some columns and sometimes
it is must.
I think the refactor of file formats should be the start for this issue.

Thanks again.

Make Hive support column based storage
--

Key: HIVE-352
URL: https://issues.apache.org/jira/browse/HIVE-352
Project: Hadoop Hive
Issue Type: New Feature
Reporter: he yongqiang

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-347) [hive] lot of mappers due to a user error while specifying the partitioning column

2009-03-17 Thread Namit Jain (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain updated HIVE-347:


   Resolution: Fixed
Fix Version/s: 0.3.0
   Status: Resolved  (was: Patch Available)

committed.

 [hive] lot of mappers due to a user error while specifying the partitioning 
 column
 --

 Key: HIVE-347
 URL: https://issues.apache.org/jira/browse/HIVE-347
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Fix For: 0.3.0

 Attachments: hive.347.1.patch, hive.347.2.patch, hive.347.3.patch


 A common scenario when the table is partitioned on 'ds' column which is of 
 type 'string' of a certain format '-mm-dd'
 However, if the user forgets to add quotes while specifying the query:
 select ... from T where ds = 2009-02-02
 2009-02-02 is a valid integer expression. So, partition pruning makes all 
 partitions unknown, since 2009-02-02 to double conversion is null.
 If all partitions are unknown, in strict mode, we should thrown an error

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-350) [Hive] wrong order in explain plan

2009-03-17 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682816#action_12682816
 ] 

Namit Jain commented on HIVE-350:
-

The explain plan is not wrong perse - the order is random, and eventually the 
select corrects the order. But, it was very difficult to understand

 [Hive] wrong order in explain plan
 --

 Key: HIVE-350
 URL: https://issues.apache.org/jira/browse/HIVE-350
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: Namit Jain
 Attachments: hive.350.1.patch


 In case of multiple aggregations, the explain plan might be wrong -the order 
 of aggregations since AbParseInfo maintains the information in a hashmap, 
 which does the guarantee the results to be returned in order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-333) Add TFileTransport deserializer

2009-03-17 Thread Joydeep Sen Sarma (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joydeep Sen Sarma reassigned HIVE-333:
--

Assignee: Joydeep Sen Sarma

Add TFileTransport deserializer
---

Key: HIVE-333
URL: https://issues.apache.org/jira/browse/HIVE-333
Project: Hadoop Hive
Issue Type: New Feature
Components: Serializers/Deserializers
Environment: Linux
Reporter: Steve Corona
Assignee: Joydeep Sen Sarma

I've been googling around all night and havn't really found what I am looking
for. Basically, I want to transfer some data from my web servers to hive in
a format that's a little more verbose than plain CSV files. It seems like
JSON or thrift would be perfect for this. I am planning on sending this
serialized json or thrift data through scribe and loading it into Hive.. I
just can't figure out how to tell hive that the input data is a bunch of
serialized thrift records (all of the records are the struct type) in a
TFileTransport. Hopefully this makes sense...
Reply from Joydeep Sen Sarma (jssa...@facebook.com)
Unfortunately the open source code base does not have the loaders we run to
convert thrift records in a tfiletransport into a sequencefile that
hadoop/hive can work with. One option is that we add this to Hive code base
(should be straightforward).
No process required. Please file a jira - I will try to upload a patch this
weekend (just cut'n'paste for most part). Would appreciate some help in
finessing it out .. (the internal code is hardwired to some assumptions etc. )

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

2009-03-17 Thread Murli Varadachari


SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!

[jira] Created: (HIVE-353) Comments can't have semi-colons

2009-03-17 Thread S. Alex Smith (JIRA)

Comments can't have semi-colons
---

 Key: HIVE-353
 URL: https://issues.apache.org/jira/browse/HIVE-353
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: S. Alex Smith
Priority: Minor



hive CREATE TABLE tmp_foo(foo DOUBLE COMMENT ';');
FAILED: Parse Error: line 2:7 mismatched input 'TABLE' expecting TEMPORARY in 
create function statement

hive CREATE TABLE tmp_foo(foo DOUBLE);
OK


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

JIRA_patch-251_1.txt_UNIT_TEST_FAILED

2009-03-17 Thread Murli Varadachari


ERROR: UNIT TEST using PATCH patch-251_1.txt FAILED!!

[junit] Test org.apache.hadoop.hive.cli.TestCliDriver FAILED
BUILD FAILED

JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

2009-03-17 Thread Murli Varadachari


SUCCESS: BUILD AND UNIT TEST using PATCH hive.350.1.patch PASSED!!

[jira] Commented: (HIVE-251) Failures in Transform don't stop the job

2009-03-17 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12682866#action_12682866
 ] 

Namit Jain commented on HIVE-251:
-

+1

looks good. 

Before committing, can you remove the extra commented line in the reduce 
script, and add a comment explaining what it is doing

 Failures in Transform don't stop the job
 

 Key: HIVE-251
 URL: https://issues.apache.org/jira/browse/HIVE-251
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.2.0
Reporter: S. Alex Smith
Assignee: Ashish Thusoo
Priority: Blocker
 Fix For: 0.3.0

 Attachments: patch-251.txt, patch-251_1.txt


 If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a 
 non-zero exit status, Hive proceeds as if nothing bad happened.  The main way 
 that the user knows something bad has happened is if the user checks the logs 
 (probably because he got no output).  This is doubly bad if the program only 
 fails part of the time (say, on certain inputs) since the job will still 
 produce output and thus the problem will likely go undetected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-354) [hive] udf needed for getting length of a string

2009-03-17 Thread Namit Jain (JIRA)

[hive] udf needed for getting length of a string


 Key: HIVE-354
 URL: https://issues.apache.org/jira/browse/HIVE-354
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-251) Failures in Transform don't stop the job

2009-03-17 Thread Ashish Thusoo (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Thusoo updated HIVE-251:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

committed. Also made the changes suggested by Namit.

 Failures in Transform don't stop the job
 

 Key: HIVE-251
 URL: https://issues.apache.org/jira/browse/HIVE-251
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Affects Versions: 0.2.0
Reporter: S. Alex Smith
Assignee: Ashish Thusoo
Priority: Blocker
 Fix For: 0.3.0

 Attachments: patch-251.txt, patch-251_1.txt


 If the program executed via a SELECT TRANSFORM() USING 'foo' exits with a 
 non-zero exit status, Hive proceeds as if nothing bad happened.  The main way 
 that the user knows something bad has happened is if the user checks the logs 
 (probably because he got no output).  This is doubly bad if the program only 
 fails part of the time (say, on certain inputs) since the job will still 
 produce output and thus the problem will likely go undetected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-352) Make Hive support column based storage

Hudson build is back to normal: Hive-trunk-h0.17 #34

Can I specify a query in a test to see execution trace?

[jira] Commented: (HIVE-352) Make Hive support column based storage

[jira] Commented: (HIVE-78) Authentication infrastructure for Hive

[jira] Commented: (HIVE-352) Make Hive support column based storage

[jira] Updated: (HIVE-347) [hive] lot of mappers due to a user error while specifying the partitioning column

[jira] Commented: (HIVE-350) [Hive] wrong order in explain plan

[jira] Assigned: (HIVE-333) Add TFileTransport deserializer

JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

[jira] Created: (HIVE-353) Comments can't have semi-colons

JIRA_patch-251_1.txt_UNIT_TEST_FAILED

JIRA_hive.350.1.patch_UNIT_TEST_SUCCEEDED

[jira] Commented: (HIVE-251) Failures in Transform don't stop the job

[jira] Created: (HIVE-354) [hive] udf needed for getting length of a string

[jira] Updated: (HIVE-251) Failures in Transform don't stop the job

16 matches

Site Navigation

Mail list logo

Footer information