[jira] Updated: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-417:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
Release Note: Index support requires metastore schema upgrade (TBD).
  Resolution: Fixed

Committed.  Thanks Yongqiang!


> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Fix For: 0.7.0
>
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
> hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
> hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1492:
---

Fix Version/s: 0.6.0

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1491) fix or disable loadpart_err.q

2010-07-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893909#action_12893909
 ] 

He Yongqiang commented on HIVE-1491:


+1. Running test now.

> fix or disable loadpart_err.q
> -
>
> Key: HIVE-1491
> URL: https://issues.apache.org/jira/browse/HIVE-1491
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1491.patch
>
>
> This test fails sporadically due to a race condition, which is annoying since 
> it hinders pre-commit testing of patches.  I'm going to disable it unless 
> someone has a fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893907#action_12893907
 ] 

He Yongqiang commented on HIVE-1492:


committed to branch-0.6 as well. Thanks John!

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-417:


Status: Patch Available  (was: Open)

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.4.0, 0.3.0, 0.3.1, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Fix For: 0.7.0
>
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
> hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
> hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1491) fix or disable loadpart_err.q

2010-07-29 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1491:
-

Status: Patch Available  (was: Open)

> fix or disable loadpart_err.q
> -
>
> Key: HIVE-1491
> URL: https://issues.apache.org/jira/browse/HIVE-1491
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1491.patch
>
>
> This test fails sporadically due to a race condition, which is annoying since 
> it hinders pre-commit testing of patches.  I'm going to disable it unless 
> someone has a fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1491) fix or disable loadpart_err.q

2010-07-29 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1491:
-

Attachment: HIVE-1491.patch

> fix or disable loadpart_err.q
> -
>
> Key: HIVE-1491
> URL: https://issues.apache.org/jira/browse/HIVE-1491
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: John Sichi
> Fix For: 0.7.0
>
> Attachments: HIVE-1491.patch
>
>
> This test fails sporadically due to a race condition, which is annoying since 
> it hinders pre-commit testing of patches.  I'm going to disable it unless 
> someone has a fix.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893890#action_12893890
 ] 

John Sichi commented on HIVE-417:
-

OK, testing lucky patch 13...

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Fix For: 0.7.0
>
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
> hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
> hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-417:
--

Attachment: hive.indexing.13.patch

a new patch against trunk

> Implement Indexing in Hive
> --
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Metastore, Query Processor
>Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>Reporter: Prasad Chakka
>Assignee: He Yongqiang
> Fix For: 0.7.0
>
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, 
> hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
> hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
> hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1126) Missing some Jdbc functionality like getTables getColumns and HiveResultSet.get* methods based on column name.

2010-07-29 Thread John Sichi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sichi updated HIVE-1126:
-

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Committed.  Thanks Bennie!


> Missing some Jdbc functionality like getTables getColumns and 
> HiveResultSet.get* methods based on column name.
> --
>
> Key: HIVE-1126
> URL: https://issues.apache.org/jira/browse/HIVE-1126
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Clients
>Reporter: Bennie Schut
>Assignee: Bennie Schut
> Fix For: 0.7.0
>
> Attachments: HIVE-1126-1.patch, HIVE-1126-2.patch, HIVE-1126-3.patch, 
> HIVE-1126-4.patch, HIVE-1126-5.patch, HIVE-1126-6.patch, HIVE-1126-7.patch, 
> HIVE-1126.patch, HIVE-1126_patch(0.5.0_source).patch
>
>
> I've been using the hive jdbc driver more and more and was missing some 
> functionality which I added
> HiveDatabaseMetaData.getTables
> Using "show tables" to get the info from hive.
> HiveDatabaseMetaData.getColumns
> Using "describe tablename" to get the columns.
> This makes using something like SQuirreL a lot nicer since you have the list 
> of tables and just click on the content tab to see what's in the table.
> I also implemented
> HiveResultSet.getObject(String columnName) so you call most get* methods 
> based on the column name.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [howldev] Initial thoughts on authorization in howl

2010-07-29 Thread Pradeep Kamath
Hi Ashish,
   The changes I was mentioning in the mail were changes in hive code.
For howl, we will have howl specific semantic analyzers which will
enforce authorization for ddl (like create table/drop table) against
hdfs permissions. This is our initial thought on authorization for DDL
through howl CLI - note, this does NOT change anything for hive CLI. I
did notice there was jira in hive for authorization which seems similar
to SQL authorization. The big issue there is reconciling SQL permissions
with hdfs permissions
(https://issues.apache.org/jira/browse/HIVE-78?focusedCommentId=12682719
&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel
#action_12682719)

So for now, we are going the hdfs route initially in howl - this may not
be the final story for authorization in howl but is a beginning.

I was initially thinking of using the hive conf variables and not table
properties but I guess either could be used. Initially I thought we
won't need to persist the group and permissions in the metastore and
just use that info while creating the table dir. Later when creating
partition dirs, we would just consult the table dir. If we use table
properties, we can persist the group/perms info in the metastore and
partition creation can get it from the metastore.


Pradeep

-Original Message-
From: Ashish Thusoo [mailto:athu...@facebook.com] 
Sent: Thursday, July 29, 2010 3:01 PM
To: hive-dev@hadoop.apache.org
Cc: howl...@yahoogroups.com
Subject: RE: [howldev] Initial thoughts on authorization in howl

Hi Pradeep,

I get from this note that the authorization that you are talking about
here are basically the management of the permissions on the hdfs
directories corresponding to the tables and the partitions. So from that
angle this sounds good to me. There is a whole set of
permissions/authorizations with regard to the metadata operations
themselves eg. Who should be able to run an alter table add column or
describe table etc. I presume that would be beyond the scope of this
change and would come in later? I am thinking more in terms of the
permissions model that is supported in SQL using GRANT statements etc.

I also presume that by conf variables you mean the key value properties
that Hive can store in the metadata and not the hive conf variables,
right?

Ashish

-Original Message-
From: John Sichi [mailto:jsi...@facebook.com] 
Sent: Wednesday, July 28, 2010 2:22 PM
To: hive-dev@hadoop.apache.org
Subject: Fwd: [howldev] Initial thoughts on authorization in howl

Begin forwarded message:

From: Pradeep Kamath
mailto:prade...@yahoo-inc.com>>
Date: July 27, 2010 4:38:42 PM PDT
To: mailto:howl...@yahoogroups.com>>
Subject: [howldev] Initial thoughts on authorization in howl
Reply-To: mailto:howl...@yahoogroups.com>>



The initial thoughts on authorization in howl are to model authorization
(for DDL ops like create table/drop table/add partition etc) after hdfs
permissions. To be able to do this, we would like to extend
createTable() to add the ability to record a different group from the
user's primary group and to record the complete unix permissions on the
table directory. Also, we would like to have a way for partition
directories to inherit permissions and group information based on the
table directory. To keep the metastore backward compatible for use with
hive, I propose having conf variables to achieve these objectives:
-  table.group.name - value will
indicate the name of the unix group for the table directory. This will
be used by createTable() to perform a chgrp to the value provided. This
property will provide the user the ability to choose from one of the
many unix groups he is part of to associate with the table.
-  table.permissions - value will be of the form rwxrwxrwx to
indicate read-write-execute permissions on the table directory. This
will be used by createTable() to perform a chmod to the value provided.
This will let the user decide what permissions he wants on the table.
-  partitions.inherit.permissions - a value of true will
indicate that partitions inherit the group name and permissions of the
table level directory. This will be used by addPartition() to perform a
chgrp and chmod to the values as on the table directory.

I favor conf properties over API changes since the complete
authorization design for hive is not finalized yet. These properties can
be deprecated/removed when that is in place. These properties would also
be useful to some installation of vanilla hive since at least DFS level
authorization can now be achieved by hive without the user having to
manually perform chgrp and chmod operations on DFS.

I would like to hear from hive developers/committers whether this would
be acceptable for hive and also thoughts from others.

Pradeep



__._,_.___


Your email settings: Individual Email|Traditional Change settings via
the
Web

Re: load_dyn_part2.q on Hadoop 17

2010-07-29 Thread John Sichi
I only ran clean in between, not clean-test.  But the test target depends on 
clean-test, so I don't think doing that explicitly would have made a difference.

JVS

On Jul 29, 2010, at 4:51 PM, Ning Zhang wrote:

> Did you run ant clean clean-test after 20 and before 17? If no some tests 
> later than load_dyn_part2 could changed srcpart.
>
> On Jul 29, 2010, at 4:40 PM, John Sichi wrote:
>
>> Yes, it passed for me just now when run in isolation.
>>
>> I wonder why it passed on the first full test run (on Hadoop 20), but then 
>> failed when I re-ran against 17.
>>
>> JVS
>>
>> On Jul 29, 2010, at 4:11 PM, Ning Zhang wrote:
>>
>>> John, it works for me when running this test alone. One thing I noticed is 
>>> that the result you got have additional partitions (hr=13..19). This can be 
>>> explained by the srcpart was changed (additional partitioned added by other 
>>> tests). Since now we don't clean up the srcpart tables for each .q file, 
>>> the side effects of previous .q files could remain. Can you check if you 
>>> can get the correct results by testing this .q file alone?
>>>
>>>
>>> On Jul 29, 2010, at 1:08 PM, John Sichi wrote:
>>>
 I just hit a test failure with this on latest trunk (while testing out a 
 patch); see diff output below.  Do you know if this broke recently?  Same 
 code passed on Hadoop 20.

 JVS

 

 [jsi...@dev578 ~/open/commit-trunk] diff 
 ql/src/test/results/clientpositive/load_dyn_part2.q.out 
 build/ql/test/logs/clientpositive/load_dyn_part2.q.out
 --- ql/src/test/results/clientpositive/load_dyn_part2.q.out 2010-07-28 
 23:16:54.0 -0700
 +++ build/ql/test/logs/clientpositive/load_dyn_part2.q.out  2010-07-29 
 09:57:13.0 -0700
 @@ -16,7 +16,7 @@
 ds string
 hr string

 -Detailed Table Information Table(tableName:nzhang_part_bucket, 
 dbName:default, owner:jssarma, createTime:1279737530, lastAccessTime:0, 
 retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
 comment:null), FieldSchema(name:value, type:string, comment:null)], 
 location:file:/mnt/vol/devrs004.snc1/jssarma/projects/hive_trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
 outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
 compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
 parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
 parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
 comment:null), FieldSchema(name:hr, type:string, comment:null)], 
 parameters:{transient_lastDdlTime=1279737530}, viewOriginalText:null, 
 viewExpandedText:null, tableType:MANAGED_TABLE)
 +Detailed Table Information Table(tableName:nzhang_part_bucket, 
 dbName:default, owner:jsichi, createTime:1280422615, lastAccessTime:0, 
 retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
 comment:null), FieldSchema(name:value, type:string, comment:null)], 
 location:pfile:/data/users/jsichi/open/commit-trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
 outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
 compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
 serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
 parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
 parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
 comment:null), FieldSchema(name:hr, type:string, comment:null)], 
 parameters:{transient_lastDdlTime=1280422615}, viewOriginalText:null, 
 viewExpandedText:null, tableType:MANAGED_TABLE)
 PREHOOK: query: explain
 insert overwrite table nzhang_part_bucket partition (ds='2010-03-23', hr) 
 select key, value, hr from srcpart where ds is not null and hr is not null
 PREHOOK: type: QUERY
 @@ -104,34 +104,98 @@
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=10
 POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=11
 POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=12
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=13
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=14
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=15
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=16
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=17
 +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=18
 +

Re: load_dyn_part2.q on Hadoop 17

2010-07-29 Thread Ning Zhang
Did you run ant clean clean-test after 20 and before 17? If no some tests later 
than load_dyn_part2 could changed srcpart.

On Jul 29, 2010, at 4:40 PM, John Sichi wrote:

> Yes, it passed for me just now when run in isolation.
>
> I wonder why it passed on the first full test run (on Hadoop 20), but then 
> failed when I re-ran against 17.
>
> JVS
>
> On Jul 29, 2010, at 4:11 PM, Ning Zhang wrote:
>
>> John, it works for me when running this test alone. One thing I noticed is 
>> that the result you got have additional partitions (hr=13..19). This can be 
>> explained by the srcpart was changed (additional partitioned added by other 
>> tests). Since now we don't clean up the srcpart tables for each .q file, the 
>> side effects of previous .q files could remain. Can you check if you can get 
>> the correct results by testing this .q file alone?
>>
>>
>> On Jul 29, 2010, at 1:08 PM, John Sichi wrote:
>>
>>> I just hit a test failure with this on latest trunk (while testing out a 
>>> patch); see diff output below.  Do you know if this broke recently?  Same 
>>> code passed on Hadoop 20.
>>>
>>> JVS
>>>
>>> 
>>>
>>> [jsi...@dev578 ~/open/commit-trunk] diff 
>>> ql/src/test/results/clientpositive/load_dyn_part2.q.out 
>>> build/ql/test/logs/clientpositive/load_dyn_part2.q.out
>>> --- ql/src/test/results/clientpositive/load_dyn_part2.q.out 2010-07-28 
>>> 23:16:54.0 -0700
>>> +++ build/ql/test/logs/clientpositive/load_dyn_part2.q.out  2010-07-29 
>>> 09:57:13.0 -0700
>>> @@ -16,7 +16,7 @@
>>> ds string
>>> hr string
>>>
>>> -Detailed Table Information Table(tableName:nzhang_part_bucket, 
>>> dbName:default, owner:jssarma, createTime:1279737530, lastAccessTime:0, 
>>> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
>>> comment:null), FieldSchema(name:value, type:string, comment:null)], 
>>> location:file:/mnt/vol/devrs004.snc1/jssarma/projects/hive_trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>>>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
>>> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
>>> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
>>> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
>>> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
>>> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
>>> parameters:{transient_lastDdlTime=1279737530}, viewOriginalText:null, 
>>> viewExpandedText:null, tableType:MANAGED_TABLE)
>>> +Detailed Table Information Table(tableName:nzhang_part_bucket, 
>>> dbName:default, owner:jsichi, createTime:1280422615, lastAccessTime:0, 
>>> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
>>> comment:null), FieldSchema(name:value, type:string, comment:null)], 
>>> location:pfile:/data/users/jsichi/open/commit-trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>>>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
>>> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
>>> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
>>> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
>>> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
>>> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
>>> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
>>> parameters:{transient_lastDdlTime=1280422615}, viewOriginalText:null, 
>>> viewExpandedText:null, tableType:MANAGED_TABLE)
>>> PREHOOK: query: explain
>>> insert overwrite table nzhang_part_bucket partition (ds='2010-03-23', hr) 
>>> select key, value, hr from srcpart where ds is not null and hr is not null
>>> PREHOOK: type: QUERY
>>> @@ -104,34 +104,98 @@
>>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
>>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
>>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=10
>>> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=11
>>> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=12
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=13
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=14
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=15
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=16
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=17
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=18
>>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=19
>>> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).key 
>>> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
>>> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).value 

Re: load_dyn_part2.q on Hadoop 17

2010-07-29 Thread John Sichi
Yes, it passed for me just now when run in isolation.

I wonder why it passed on the first full test run (on Hadoop 20), but then 
failed when I re-ran against 17.

JVS

On Jul 29, 2010, at 4:11 PM, Ning Zhang wrote:

> John, it works for me when running this test alone. One thing I noticed is 
> that the result you got have additional partitions (hr=13..19). This can be 
> explained by the srcpart was changed (additional partitioned added by other 
> tests). Since now we don't clean up the srcpart tables for each .q file, the 
> side effects of previous .q files could remain. Can you check if you can get 
> the correct results by testing this .q file alone?
>
>
> On Jul 29, 2010, at 1:08 PM, John Sichi wrote:
>
>> I just hit a test failure with this on latest trunk (while testing out a 
>> patch); see diff output below.  Do you know if this broke recently?  Same 
>> code passed on Hadoop 20.
>>
>> JVS
>>
>> 
>>
>> [jsi...@dev578 ~/open/commit-trunk] diff 
>> ql/src/test/results/clientpositive/load_dyn_part2.q.out 
>> build/ql/test/logs/clientpositive/load_dyn_part2.q.out
>> --- ql/src/test/results/clientpositive/load_dyn_part2.q.out 2010-07-28 
>> 23:16:54.0 -0700
>> +++ build/ql/test/logs/clientpositive/load_dyn_part2.q.out  2010-07-29 
>> 09:57:13.0 -0700
>> @@ -16,7 +16,7 @@
>> ds string
>> hr string
>>
>> -Detailed Table Information Table(tableName:nzhang_part_bucket, 
>> dbName:default, owner:jssarma, createTime:1279737530, lastAccessTime:0, 
>> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
>> comment:null), FieldSchema(name:value, type:string, comment:null)], 
>> location:file:/mnt/vol/devrs004.snc1/jssarma/projects/hive_trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
>> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
>> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
>> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
>> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
>> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
>> parameters:{transient_lastDdlTime=1279737530}, viewOriginalText:null, 
>> viewExpandedText:null, tableType:MANAGED_TABLE)
>> +Detailed Table Information Table(tableName:nzhang_part_bucket, 
>> dbName:default, owner:jsichi, createTime:1280422615, lastAccessTime:0, 
>> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
>> comment:null), FieldSchema(name:value, type:string, comment:null)], 
>> location:pfile:/data/users/jsichi/open/commit-trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
>> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
>> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
>> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
>> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
>> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
>> parameters:{transient_lastDdlTime=1280422615}, viewOriginalText:null, 
>> viewExpandedText:null, tableType:MANAGED_TABLE)
>> PREHOOK: query: explain
>> insert overwrite table nzhang_part_bucket partition (ds='2010-03-23', hr) 
>> select key, value, hr from srcpart where ds is not null and hr is not null
>> PREHOOK: type: QUERY
>> @@ -104,34 +104,98 @@
>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
>> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=10
>> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=11
>> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=12
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=13
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=14
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=15
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=16
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=17
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=18
>> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=19
>> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).key 
>> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
>> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).value 
>> SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
>> POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=11).key 
>> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
>> POSTHOOK: Lineage: nzhang_p

Re: load_dyn_part2.q on Hadoop 17

2010-07-29 Thread Ning Zhang
John, it works for me when running this test alone. One thing I noticed is that 
the result you got have additional partitions (hr=13..19). This can be 
explained by the srcpart was changed (additional partitioned added by other 
tests). Since now we don't clean up the srcpart tables for each .q file, the 
side effects of previous .q files could remain. Can you check if you can get 
the correct results by testing this .q file alone?


On Jul 29, 2010, at 1:08 PM, John Sichi wrote:

> I just hit a test failure with this on latest trunk (while testing out a 
> patch); see diff output below.  Do you know if this broke recently?  Same 
> code passed on Hadoop 20.
>
> JVS
>
> 
>
> [jsi...@dev578 ~/open/commit-trunk] diff 
> ql/src/test/results/clientpositive/load_dyn_part2.q.out 
> build/ql/test/logs/clientpositive/load_dyn_part2.q.out
> --- ql/src/test/results/clientpositive/load_dyn_part2.q.out 2010-07-28 
> 23:16:54.0 -0700
> +++ build/ql/test/logs/clientpositive/load_dyn_part2.q.out  2010-07-29 
> 09:57:13.0 -0700
> @@ -16,7 +16,7 @@
> ds string
> hr string
>
> -Detailed Table Information Table(tableName:nzhang_part_bucket, 
> dbName:default, owner:jssarma, createTime:1279737530, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
> comment:null), FieldSchema(name:value, type:string, comment:null)], 
> location:file:/mnt/vol/devrs004.snc1/jssarma/projects/hive_trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
> parameters:{transient_lastDdlTime=1279737530}, viewOriginalText:null, 
> viewExpandedText:null, tableType:MANAGED_TABLE)
> +Detailed Table Information Table(tableName:nzhang_part_bucket, 
> dbName:default, owner:jsichi, createTime:1280422615, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
> comment:null), FieldSchema(name:value, type:string, comment:null)], 
> location:pfile:/data/users/jsichi/open/commit-trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
> parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, 
> comment:null), FieldSchema(name:hr, type:string, comment:null)], 
> parameters:{transient_lastDdlTime=1280422615}, viewOriginalText:null, 
> viewExpandedText:null, tableType:MANAGED_TABLE)
> PREHOOK: query: explain
> insert overwrite table nzhang_part_bucket partition (ds='2010-03-23', hr) 
> select key, value, hr from srcpart where ds is not null and hr is not null
> PREHOOK: type: QUERY
> @@ -104,34 +104,98 @@
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
> POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=10
> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=11
> POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=12
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=13
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=14
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=15
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=16
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=17
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=18
> +POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=19
> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).key 
> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
> +POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).value 
> SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
> POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=11).key 
> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
> POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=11).value 
> SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
> POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=12).key 
> SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
> POSTHOOK: Lineage: nzhan

[jira] Commented: (HIVE-1422) skip counter update when RunningJob.getCounters() returns null

2010-07-29 Thread Joydeep Sen Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893852#action_12893852
 ] 

Joydeep Sen Sarma commented on HIVE-1422:
-

i looked at the hadoop source for 20 a bit. looks like both getCounters() and 
getJob() can return null (in case the job cannot be found). on 0.20 - completed 
jobs are looked up from persistent store - so i think this is pretty hard to 
happen (if it does - it seems like a hadoop bug). but for 17 (and maybe other 
versions in between) - we need to guard against these.

> skip counter update when RunningJob.getCounters() returns null
> --
>
> Key: HIVE-1422
> URL: https://issues.apache.org/jira/browse/HIVE-1422
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.6.0
>Reporter: John Sichi
>Assignee: Joydeep Sen Sarma
> Fix For: 0.7.0
>
> Attachments: HIVE-1422.1.patch
>
>
> Under heavy load circumstances on some Hadoop versions, we may get a NPE from 
> trying to dereference a null Counters object.  I don't have a unit test which 
> can reproduce it, but here's an example stack from a production cluster we 
> saw today:
> 10/06/21 13:01:10 ERROR exec.ExecDriver: Ended Job = job_201005200457_701060 
> with exception 'java.lang.NullPointerException(null)'
> java.lang.NullPointerException
> at org.apache.hadoop.hive.ql.exec.Operator.updateCounters(Operator.java:999)
> at 
> org.apache.hadoop.hive.ql.exec.ExecDriver.updateCounters(ExecDriver.java:503)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.progress(ExecDriver.java:390)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:697)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:107)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:55)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Hive Web Interface Broken YET AGAIN!

2010-07-29 Thread Ashish Thusoo
Can you point to the JIRA that introduced this problem?

Ashish 

-Original Message-
From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Thursday, July 29, 2010 7:38 AM
To: 
Subject: Hive Web Interface Broken YET AGAIN!

All,

While the web interface is not as widely used as the cli, people do use it. Its 
init process has been broken 3 times I can remember (once by the shims), once 
by adding version numbers to the jars, and now it is affected by the libjars.

[r...@etl02 ~]# hive --service hwi
Exception in thread "main" java.io.IOException: Error opening job jar: -libjars
at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:114)
at java.util.jar.JarFile.(JarFile.java:133)
at java.util.jar.JarFile.(JarFile.java:70)
at org.apache.hadoop.util.RunJar.main(RunJar.java:88)

I notice someone patched the cli do deal with this. There is no test coverage 
for the shell scripts.

But it seems like only some of the scripts were repaired:

bin/ext/cli.sh
bin/ext/lineage.sh
bin/ext/metastore.sh

I wonder why only half the scripts are repaired? In general if something 
changes in hive or hadoop that causes the cli to break we should fix it across 
the board. I feel like every time a release is coming up I test drive the web 
interface to find a simple script problem stops it from running.

Edward


RE: [howldev] Initial thoughts on authorization in howl

2010-07-29 Thread Ashish Thusoo
Hi Pradeep,

I get from this note that the authorization that you are talking about here are 
basically the management of the permissions on the hdfs directories 
corresponding to the tables and the partitions. So from that angle this sounds 
good to me. There is a whole set of permissions/authorizations with regard to 
the metadata operations themselves eg. Who should be able to run an alter table 
add column or describe table etc. I presume that would be beyond the scope of 
this change and would come in later? I am thinking more in terms of the 
permissions model that is supported in SQL using GRANT statements etc.

I also presume that by conf variables you mean the key value properties that 
Hive can store in the metadata and not the hive conf variables, right?

Ashish

-Original Message-
From: John Sichi [mailto:jsi...@facebook.com] 
Sent: Wednesday, July 28, 2010 2:22 PM
To: hive-dev@hadoop.apache.org
Subject: Fwd: [howldev] Initial thoughts on authorization in howl

Begin forwarded message:

From: Pradeep Kamath mailto:prade...@yahoo-inc.com>>
Date: July 27, 2010 4:38:42 PM PDT
To: mailto:howl...@yahoogroups.com>>
Subject: [howldev] Initial thoughts on authorization in howl
Reply-To: mailto:howl...@yahoogroups.com>>



The initial thoughts on authorization in howl are to model authorization (for 
DDL ops like create table/drop table/add partition etc) after hdfs permissions. 
To be able to do this, we would like to extend createTable() to add the ability 
to record a different group from the user's primary group and to record the 
complete unix permissions on the table directory. Also, we would like to have a 
way for partition directories to inherit permissions and group information 
based on the table directory. To keep the metastore backward compatible for use 
with hive, I propose having conf variables to achieve these objectives:
-  table.group.name - value will indicate the 
name of the unix group for the table directory. This will be used by 
createTable() to perform a chgrp to the value provided. This property will 
provide the user the ability to choose from one of the many unix groups he is 
part of to associate with the table.
-  table.permissions - value will be of the form rwxrwxrwx to indicate 
read-write-execute permissions on the table directory. This will be used by 
createTable() to perform a chmod to the value provided. This will let the user 
decide what permissions he wants on the table.
-  partitions.inherit.permissions - a value of true will indicate that 
partitions inherit the group name and permissions of the table level directory. 
This will be used by addPartition() to perform a chgrp and chmod to the values 
as on the table directory.

I favor conf properties over API changes since the complete authorization 
design for hive is not finalized yet. These properties can be 
deprecated/removed when that is in place. These properties would also be useful 
to some installation of vanilla hive since at least DFS level authorization can 
now be achieved by hive without the user having to manually perform chgrp and 
chmod operations on DFS.

I would like to hear from hive developers/committers whether this would be 
acceptable for hive and also thoughts from others.

Pradeep



__._,_.___


Your email settings: Individual Email|Traditional Change settings via the 
Web
 (Yahoo! ID required) Change settings via email: Switch delivery to Daily 
Digest
 | Switch to Fully 
Featured
Visit Your Group 

 | Yahoo! Groups Terms of Use  | Unsubscribe 


__,_._,___



Hudson build is back to normal : Hive-trunk-h0.20 #331

2010-07-29 Thread Apache Hudson Server
See 




load_dyn_part2.q on Hadoop 17

2010-07-29 Thread John Sichi
I just hit a test failure with this on latest trunk (while testing out a 
patch); see diff output below.  Do you know if this broke recently?  Same code 
passed on Hadoop 20.

JVS



[jsi...@dev578 ~/open/commit-trunk] diff 
ql/src/test/results/clientpositive/load_dyn_part2.q.out 
build/ql/test/logs/clientpositive/load_dyn_part2.q.out
--- ql/src/test/results/clientpositive/load_dyn_part2.q.out 2010-07-28 
23:16:54.0 -0700
+++ build/ql/test/logs/clientpositive/load_dyn_part2.q.out  2010-07-29 
09:57:13.0 -0700
@@ -16,7 +16,7 @@
 ds string
 hr string

-Detailed Table Information Table(tableName:nzhang_part_bucket, 
dbName:default, owner:jssarma, createTime:1279737530, lastAccessTime:0, 
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
comment:null), FieldSchema(name:value, type:string, comment:null)], 
location:file:/mnt/vol/devrs004.snc1/jssarma/projects/hive_trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
 inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, comment:null), 
FieldSchema(name:hr, type:string, comment:null)], 
parameters:{transient_lastDdlTime=1279737530}, viewOriginalText:null, 
viewExpandedText:null, tableType:MANAGED_TABLE)
+Detailed Table Information Table(tableName:nzhang_part_bucket, 
dbName:default, owner:jsichi, createTime:1280422615, lastAccessTime:0, 
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, 
comment:null), FieldSchema(name:value, type:string, comment:null)], 
location:pfile:/data/users/jsichi/open/commit-trunk/build/ql/test/data/warehouse/nzhang_part_bucket,
 inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
compressed:false, numBuckets:10, serdeInfo:SerDeInfo(name:null, 
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
parameters:{serialization.format=1}), bucketCols:[key], sortCols:[], 
parameters:{}), partitionKeys:[FieldSchema(name:ds, type:string, comment:null), 
FieldSchema(name:hr, type:string, comment:null)], 
parameters:{transient_lastDdlTime=1280422615}, viewOriginalText:null, 
viewExpandedText:null, tableType:MANAGED_TABLE)
 PREHOOK: query: explain
 insert overwrite table nzhang_part_bucket partition (ds='2010-03-23', hr) 
select key, value, hr from srcpart where ds is not null and hr is not null
 PREHOOK: type: QUERY
@@ -104,34 +104,98 @@
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-08/hr=12
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=11
 POSTHOOK: Input: defa...@srcpart@ds=2008-04-09/hr=12
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=10
 POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=11
 POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=12
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=13
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=14
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=15
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=16
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=17
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=18
+POSTHOOK: Output: defa...@nzhang_part_bucket@ds=2010-03-23/hr=19
+POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).key 
SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
+POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=10).value 
SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
 POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=11).key 
SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
 POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=11).value 
SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
 POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=12).key 
SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
 POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=12).value 
SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
+POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=13).key 
SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
+POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=13).value 
SIMPLE [(srcpart)srcpart.FieldSchema(name:hr, type:string, comment:null), ]
+POSTHOOK: Lineage: nzhang_part_bucket PARTITION(ds=2010-03-23,hr=14).key 
SIMPLE [(srcpart)srcpart.FieldSchema(name:ds, type:string, comment:null), ]
+POSTHOOK: Lineage: nzhang_part_bucket

FW: Announcing Howl development list

2010-07-29 Thread Pradeep Kamath
FYI - howl is a project to create a shared metadata system between Pig,
Hive, and Map Reduce based on hive metastore - below is a mail from Alan
Gates on the pig dev list about it with pointers to a wiki and a mailing
list.

Pradeep

-Original Message-
From: Alan Gates [mailto:ga...@yahoo-inc.com] 
Sent: Tuesday, July 20, 2010 10:04 AM
To: pig-...@hadoop.apache.org
Cc: Carl Steinbach; Dmitriy Ryaboy
Subject: Announcing Howl development list

A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl

A howldev mailing list has been set up on Yahoo! groups for  
discussions on Howl.  You can subscribe by sending mail to
howldev-subscr...@yahoogroups.com 
.  We plan on putting the code on github in a read only repository.   
It will be a few more days before we get there.  It will be announced  
on the list when it is.

Alan.



RE: Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread Ashish Thusoo
Before deciding that, we should pool the user list to see if this would be too 
disruptive for anyone..

Ashish 

-Original Message-
From: Ning Zhang [mailto:nzh...@facebook.com] 
Sent: Thursday, July 29, 2010 12:18 PM
To: 
Subject: Re: Hive should start moving to the new hadoop mapreduce api.

Maybe we should decide hive-0.7 as the last branch to support hadoop pre-0.20 
API and later branches of Hive will be switched to the new hadoop API?

On Jul 29, 2010, at 11:53 AM, Ashish Thusoo wrote:

> Yes these are mutually exclusive.
> 
> Ashish
> 
> -Original Message-
> From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
> Sent: Thursday, July 29, 2010 11:20 AM
> To: hive-dev@hadoop.apache.org
> Subject: Re: Hive should start moving to the new hadoop mapreduce api.
> 
> Aren't these things mutually exclusive?
> The new Map Reduce API appeared in 20.
> Deprecating 17 seems reasonable, but we still have to support the old api for 
> 18 and 19 correct?
> 
> On Thu, Jul 29, 2010 at 2:11 PM, Ashish Thusoo  wrote:
>> +1 to this
>> 
>> Ashish
>> 
>> -Original Message-
>> From: yongqiang he [mailto:heyongqiang...@gmail.com]
>> Sent: Thursday, July 29, 2010 10:54 AM
>> To: hive-dev@hadoop.apache.org
>> Subject: Hive should start moving to the new hadoop mapreduce api.
>> 
>> Hi all,
>> 
>> In offline discussions when we fixing HIVE-1492, we think it maybe good now 
>> to start thinking to move Hive to use new MapReduce context API, and also 
>> start deprecating Hadoop-0.17.0 support in Hive.
>> Basically the new MapReduce API gives Hive more control at runtime.
>> 
>> Any thoughts on this?
>> 
>> 
>> Thanks
>> 



Build failed in Hudson: Hive-trunk-h0.19 #508

2010-07-29 Thread Apache Hudson Server
See 

Changes:

[heyongqiang] HIVE-1492. FileSinkOperator should remove duplicated files from 
the same task based on file sizes.(Ning Zhang via He Yongqiang)

[nzhang] HIVE-1408. add option to let hive automatically run in local mode 
based on tunable heuristics (Joydeep Sen Sarma via Ning Zhang)

[jvs] HIVE-1481:  whoops, svn add data/files/text-en.txt

[jvs] HIVE-1481. ngrams() UDAF for estimating top-k n-gram frequencies
(Mayank Lahiri via jvs)

--
[...truncated 13348 lines...]
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 

[junit] Loading data to table src
[junit] POSTHOOK: Output: defa...@src
[junit] OK
[junit] Copying data from 

[junit] Loading data to table src1
[junit] POSTHOOK: Output: defa...@src1
[junit] OK
[junit] Copying data from 

[junit] Loading data to table src_sequencefile
[junit] POSTHOOK: Output: defa...@src_sequencefile
[junit] OK
[junit] Copying data from 

[junit] Loading data to table src_thrift
[junit] POSTHOOK: Output: defa...@src_thrift
[junit] OK
[junit] Copying data from 

[junit] Loading data to table src_json
[junit] POSTHOOK: Output: defa...@src_json
[junit] OK
[junit] diff 

 

[junit] Done query: unknown_table1.q
[junit] Begin query: unknown_table2.q
[junit] Copying data from 

[junit] Loading data to table srcpart partition (ds=2008-04-08, hr=11)
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcpart partition (ds=2008-04-08, hr=12)
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcpart partition (ds=2008-04-09, hr=11)
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcpart partition (ds=2008-04-09, hr=12)
[junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket
[junit] POSTHOOK: Output: defa...@srcbucket
[junit] OK
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 

[junit] Loading data to table srcbucket2
[junit] POSTHOOK: Output: defa...@srcbucket2
[junit] OK
[junit] Copying data from 


[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893788#action_12893788
 ] 

Ning Zhang commented on HIVE-1492:
--

@Edward, this is a heuristics that should be generally true. The good news is 
that we are not aware of any exceptions that violate the rule (assuming 
multiple attempts of the same task give deterministic results). 

The reason that we are relying on heuristics here is that the old Hadoop API 
doesn't not support exception handling outside Mapper's map() function. The bug 
presents if an exception was thrown by Hadoop's RecordReader layer and it does 
not pass the message to the Mapper. When the mapper.close() is called there is 
not way the mapper know whether there is an exception happened in the Hadoop 
code path. A better way to handle this is to use the new Hadoop API that gives 
more control to the application layer. This heuristics is a workaround based on 
the old Hadoop API. 


> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893786#action_12893786
 ] 

He Yongqiang commented on HIVE-1492:


running test on branch-0.6

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread Siying Dong
Larger files are not guaranteed to be the right ones. (For example, there could 
be user defined transform scripts that can freely access external resources and 
generate anything which we don't have control.) But larger files, rather than 
the first one, are much more likely to be the correct one. Before we use the 
new MapReduce API to fix the issue of generating wrong results in MapReduce, 
this patch will help us fix the problem in most scenarios.

-Original Message-
From: He Yongqiang (JIRA) [mailto:j...@apache.org] 
Sent: Thursday, July 29, 2010 12:12 PM
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1492) FileSinkOperator should remove 
duplicated files from the same task based on file sizes


[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782
 ] 

He Yongqiang commented on HIVE-1492:


The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the 
same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread Ning Zhang
Maybe we should decide hive-0.7 as the last branch to support hadoop pre-0.20 
API and later branches of Hive will be switched to the new hadoop API?

On Jul 29, 2010, at 11:53 AM, Ashish Thusoo wrote:

> Yes these are mutually exclusive.
> 
> Ashish 
> 
> -Original Message-
> From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
> Sent: Thursday, July 29, 2010 11:20 AM
> To: hive-dev@hadoop.apache.org
> Subject: Re: Hive should start moving to the new hadoop mapreduce api.
> 
> Aren't these things mutually exclusive?
> The new Map Reduce API appeared in 20.
> Deprecating 17 seems reasonable, but we still have to support the old api for 
> 18 and 19 correct?
> 
> On Thu, Jul 29, 2010 at 2:11 PM, Ashish Thusoo  wrote:
>> +1 to this
>> 
>> Ashish
>> 
>> -Original Message-
>> From: yongqiang he [mailto:heyongqiang...@gmail.com]
>> Sent: Thursday, July 29, 2010 10:54 AM
>> To: hive-dev@hadoop.apache.org
>> Subject: Hive should start moving to the new hadoop mapreduce api.
>> 
>> Hi all,
>> 
>> In offline discussions when we fixing HIVE-1492, we think it maybe good now 
>> to start thinking to move Hive to use new MapReduce context API, and also 
>> start deprecating Hadoop-0.17.0 support in Hive.
>> Basically the new MapReduce API gives Hive more control at runtime.
>> 
>> Any thoughts on this?
>> 
>> 
>> Thanks
>> 



[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782
 ] 

He Yongqiang commented on HIVE-1492:


The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the 
same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893772#action_12893772
 ] 

Edward Capriolo commented on HIVE-1492:
---

"the largest file is the correct file" 
Is that generally true or an absolute fact?

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1471) CTAS should unescape the column name in the select-clause.

2010-07-29 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang updated HIVE-1471:
-

Status: Resolved  (was: Patch Available)
Resolution: Fixed

committed by Joydeep already. Closing.

> CTAS should unescape the column name in the select-clause. 
> ---
>
> Key: HIVE-1471
> URL: https://issues.apache.org/jira/browse/HIVE-1471
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1471.patch
>
>
> The following query 
> {{{
> create table T as select `to` from S;
> }}}
> failed since `to` should be unescaped before creating the table. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread Ashish Thusoo
Yes these are mutually exclusive.

Ashish 

-Original Message-
From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Thursday, July 29, 2010 11:20 AM
To: hive-dev@hadoop.apache.org
Subject: Re: Hive should start moving to the new hadoop mapreduce api.

Aren't these things mutually exclusive?
The new Map Reduce API appeared in 20.
Deprecating 17 seems reasonable, but we still have to support the old api for 
18 and 19 correct?

On Thu, Jul 29, 2010 at 2:11 PM, Ashish Thusoo  wrote:
> +1 to this
>
> Ashish
>
> -Original Message-
> From: yongqiang he [mailto:heyongqiang...@gmail.com]
> Sent: Thursday, July 29, 2010 10:54 AM
> To: hive-dev@hadoop.apache.org
> Subject: Hive should start moving to the new hadoop mapreduce api.
>
> Hi all,
>
> In offline discussions when we fixing HIVE-1492, we think it maybe good now 
> to start thinking to move Hive to use new MapReduce context API, and also 
> start deprecating Hadoop-0.17.0 support in Hive.
> Basically the new MapReduce API gives Hive more control at runtime.
>
> Any thoughts on this?
>
>
> Thanks
>


[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread Ning Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang updated HIVE-1492:
-

Attachment: HIVE-1492_branch-0.6.patch

Uploading a patch for branch-0.6.

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread Edward Capriolo
Aren't these things mutually exclusive?
The new Map Reduce API appeared in 20.
Deprecating 17 seems reasonable, but we still have to support the old
api for 18 and 19 correct?

On Thu, Jul 29, 2010 at 2:11 PM, Ashish Thusoo  wrote:
> +1 to this
>
> Ashish
>
> -Original Message-
> From: yongqiang he [mailto:heyongqiang...@gmail.com]
> Sent: Thursday, July 29, 2010 10:54 AM
> To: hive-dev@hadoop.apache.org
> Subject: Hive should start moving to the new hadoop mapreduce api.
>
> Hi all,
>
> In offline discussions when we fixing HIVE-1492, we think it maybe good now 
> to start thinking to move Hive to use new MapReduce context API, and also 
> start deprecating Hadoop-0.17.0 support in Hive.
> Basically the new MapReduce API gives Hive more control at runtime.
>
> Any thoughts on this?
>
>
> Thanks
>


RE: Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread Ashish Thusoo
+1 to this

Ashish

-Original Message-
From: yongqiang he [mailto:heyongqiang...@gmail.com] 
Sent: Thursday, July 29, 2010 10:54 AM
To: hive-dev@hadoop.apache.org
Subject: Hive should start moving to the new hadoop mapreduce api.

Hi all,

In offline discussions when we fixing HIVE-1492, we think it maybe good now to 
start thinking to move Hive to use new MapReduce context API, and also start 
deprecating Hadoop-0.17.0 support in Hive.
Basically the new MapReduce API gives Hive more control at runtime.

Any thoughts on this?


Thanks


Hive should start moving to the new hadoop mapreduce api.

2010-07-29 Thread yongqiang he
Hi all,

In offline discussions when we fixing HIVE-1492, we think it maybe
good now to start thinking to move Hive to use new MapReduce context
API, and also start deprecating Hadoop-0.17.0 support in Hive.
Basically the new MapReduce API gives Hive more control at runtime.

Any thoughts on this?


Thanks


[ANNOUNCE] Next HUG meetup: Noida/NCR- India - 31st July 2010 : Reminder

2010-07-29 Thread Sanjay Sharma
Hi All,

We are planning to hold the next Hadoop India User Group meet up on 31st July 
2010 in Noida, India.

The registration and event details are available at - 
http://hugindia-absolutezeroforum.eventbrite.com/



We currently have the following talks lined up-

-  HIHO - by Sonal Goyal, Meghsoft

-  JAQL- by Himanshu Gupta, IBM

-  Visual HIVE by 
Sajal Rastogi, Intellicus



Feedback/suggestions can be provided 
here.



The event is being hosted by Absolute Zero 
Forum and 
sponsored by Impetus.

We should be able to accommodate around 50-60 friendly people.



Regards,

Sanjay Sharma
www.impetus.com




Impetus is sponsoring 'Hadoop India User Group Meet Up'- a technology 
un-conference on July 31, 2010 at Impetus Office, Noida. The event will shed 
light on Hadoop technology and channelized efforts to develop an active Hadoop 
community.

Click http://www.impetus.com/ to know more. Follow our updates on 
www.twitter.com/impetuscalling .


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

2010-07-29 Thread He Yongqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-1492:
---

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.7.0
   Resolution: Fixed

I just committed. Thanks  Ning!

> FileSinkOperator should remove duplicated files from the same task based on 
> file sizes
> --
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ning Zhang
>Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to 
> retain only one file for each task. A task could produce multiple files due 
> to failed attempts or speculative runs. The largest file should be retained 
> rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1294) HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface

2010-07-29 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1294:
--

   Status: Patch Available  (was: Open)
Fix Version/s: 0.6.0

Hwi does not start correctly without this patch.

> HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface
> 
>
> Key: HIVE-1294
> URL: https://issues.apache.org/jira/browse/HIVE-1294
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.5.0
>Reporter: Dilip Joseph
>Assignee: Edward Capriolo
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: hive-1294.patch.txt
>
>
> The Hive Webserver fails to startup with the following error message, if 
> HIVE_AUX_JARS_PATH environment variable is set (works fine if unset).   
> $ build/dist/bin/hive --service hwi
> Exception in thread "main" java.io.IOException: Error opening job jar: 
> -libjars
>at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
> Caused by: java.util.zip.ZipException: error in opening zip file
>at java.util.zip.ZipFile.open(Native Method)
>at java.util.zip.ZipFile.(ZipFile.java:114)
>at java.util.jar.JarFile.(JarFile.java:133)
>at java.util.jar.JarFile.(JarFile.java:70)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
> Slightly modifying the command line to launch hadoop in hwi.sh solves the 
> problem:
> $ diff bin/ext/hwi.sh  /tmp/new-hwi.sh
> 28c28
> <   exec $HADOOP jar $AUX_JARS_CMD_LINE ${HWI_JAR_FILE} $CLASS $HIVE_OPTS "$@"
> ---
> >   exec $HADOOP jar ${HWI_JAR_FILE}  $CLASS $AUX_JARS_CMD_LINE $HIVE_OPTS 
> > "$@"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1294) HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface

2010-07-29 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1294:
--

Attachment: hive-1294.patch.txt

> HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface
> 
>
> Key: HIVE-1294
> URL: https://issues.apache.org/jira/browse/HIVE-1294
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.5.0
>Reporter: Dilip Joseph
>Assignee: Edward Capriolo
>Priority: Blocker
> Attachments: hive-1294.patch.txt
>
>
> The Hive Webserver fails to startup with the following error message, if 
> HIVE_AUX_JARS_PATH environment variable is set (works fine if unset).   
> $ build/dist/bin/hive --service hwi
> Exception in thread "main" java.io.IOException: Error opening job jar: 
> -libjars
>at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
> Caused by: java.util.zip.ZipException: error in opening zip file
>at java.util.zip.ZipFile.open(Native Method)
>at java.util.zip.ZipFile.(ZipFile.java:114)
>at java.util.jar.JarFile.(JarFile.java:133)
>at java.util.jar.JarFile.(JarFile.java:70)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
> Slightly modifying the command line to launch hadoop in hwi.sh solves the 
> problem:
> $ diff bin/ext/hwi.sh  /tmp/new-hwi.sh
> 28c28
> <   exec $HADOOP jar $AUX_JARS_CMD_LINE ${HWI_JAR_FILE} $CLASS $HIVE_OPTS "$@"
> ---
> >   exec $HADOOP jar ${HWI_JAR_FILE}  $CLASS $AUX_JARS_CMD_LINE $HIVE_OPTS 
> > "$@"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HIVE-1294) HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface

2010-07-29 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo reassigned HIVE-1294:
-

Assignee: Edward Capriolo

> HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface
> 
>
> Key: HIVE-1294
> URL: https://issues.apache.org/jira/browse/HIVE-1294
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.5.0
>Reporter: Dilip Joseph
>Assignee: Edward Capriolo
>Priority: Minor
>
> The Hive Webserver fails to startup with the following error message, if 
> HIVE_AUX_JARS_PATH environment variable is set (works fine if unset).   
> $ build/dist/bin/hive --service hwi
> Exception in thread "main" java.io.IOException: Error opening job jar: 
> -libjars
>at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
> Caused by: java.util.zip.ZipException: error in opening zip file
>at java.util.zip.ZipFile.open(Native Method)
>at java.util.zip.ZipFile.(ZipFile.java:114)
>at java.util.jar.JarFile.(JarFile.java:133)
>at java.util.jar.JarFile.(JarFile.java:70)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
> Slightly modifying the command line to launch hadoop in hwi.sh solves the 
> problem:
> $ diff bin/ext/hwi.sh  /tmp/new-hwi.sh
> 28c28
> <   exec $HADOOP jar $AUX_JARS_CMD_LINE ${HWI_JAR_FILE} $CLASS $HIVE_OPTS "$@"
> ---
> >   exec $HADOOP jar ${HWI_JAR_FILE}  $CLASS $AUX_JARS_CMD_LINE $HIVE_OPTS 
> > "$@"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HIVE-1294) HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface

2010-07-29 Thread Edward Capriolo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Capriolo updated HIVE-1294:
--

Priority: Blocker  (was: Minor)

> HIVE_AUX_JARS_PATH interferes with startup of Hive Web Interface
> 
>
> Key: HIVE-1294
> URL: https://issues.apache.org/jira/browse/HIVE-1294
> Project: Hadoop Hive
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 0.5.0
>Reporter: Dilip Joseph
>Assignee: Edward Capriolo
>Priority: Blocker
>
> The Hive Webserver fails to startup with the following error message, if 
> HIVE_AUX_JARS_PATH environment variable is set (works fine if unset).   
> $ build/dist/bin/hive --service hwi
> Exception in thread "main" java.io.IOException: Error opening job jar: 
> -libjars
>at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
> Caused by: java.util.zip.ZipException: error in opening zip file
>at java.util.zip.ZipFile.open(Native Method)
>at java.util.zip.ZipFile.(ZipFile.java:114)
>at java.util.jar.JarFile.(JarFile.java:133)
>at java.util.jar.JarFile.(JarFile.java:70)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
> Slightly modifying the command line to launch hadoop in hwi.sh solves the 
> problem:
> $ diff bin/ext/hwi.sh  /tmp/new-hwi.sh
> 28c28
> <   exec $HADOOP jar $AUX_JARS_CMD_LINE ${HWI_JAR_FILE} $CLASS $HIVE_OPTS "$@"
> ---
> >   exec $HADOOP jar ${HWI_JAR_FILE}  $CLASS $AUX_JARS_CMD_LINE $HIVE_OPTS 
> > "$@"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hive Web Interface Broken YET AGAIN!

2010-07-29 Thread Edward Capriolo
All,

While the web interface is not as widely used as the cli, people do
use it. Its init process has been broken 3 times I can remember (once
by the shims), once by adding version numbers to the jars, and now it
is affected by the libjars.

[r...@etl02 ~]# hive --service hwi
Exception in thread "main" java.io.IOException: Error opening job jar: -libjars
at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:114)
at java.util.jar.JarFile.(JarFile.java:133)
at java.util.jar.JarFile.(JarFile.java:70)
at org.apache.hadoop.util.RunJar.main(RunJar.java:88)

I notice someone patched the cli do deal with this. There is no test
coverage for the shell scripts.

But it seems like only some of the scripts were repaired:

bin/ext/cli.sh
bin/ext/lineage.sh
bin/ext/metastore.sh

I wonder why only half the scripts are repaired? In general if
something changes in hive or hadoop that causes the cli to break we
should fix it across the board. I feel like every time a release is
coming up I test drive the web interface to find a simple script
problem stops it from running.

Edward


[jira] Created: (HIVE-1493) incorrect explanation when local mode not chosen automatically

2010-07-29 Thread Joydeep Sen Sarma (JIRA)
incorrect explanation when local mode not chosen automatically
--

 Key: HIVE-1493
 URL: https://issues.apache.org/jira/browse/HIVE-1493
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Joydeep Sen Sarma
Priority: Minor


slipped past in 1408:

// check for max input size 

if (inputSummary.getLength() > maxBytes)
return "Input Size (= " + maxBytes + ") is larger than " +
HiveConf.ConfVars.LOCALMODEMAXBYTES.varname + " (= " + maxBytes + 
")";


printing same value twice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.