[jira] [Created] (HIVE-10022) DFS in authorization might take too long

2015-03-19 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-10022:


 Summary: DFS in authorization might take too long
 Key: HIVE-10022
 URL: https://issues.apache.org/jira/browse/HIVE-10022
 Project: Hive
  Issue Type: Bug
  Components: Authorization
Affects Versions: 0.14.0
Reporter: Pankit Thapar


I am testing a query like : 

set hive.test.authz.sstd.hs2.mode=true;
set 
hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactoryForTest;
set 
hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateConfigUserAuthenticator;
set hive.security.authorization.enabled=true;
set user.name=user1;
create table auth_noupd(i int) clustered by (i) into 2 buckets stored as orc 
location '${OUTPUT}' TBLPROPERTIES ('transactional'='true');

Now, in the above query,  since authorization is true, 
we would end up calling doAuthorizationV2() which ultimately ends up calling 
SQLAuthorizationUtils.getPrivilegesFromFS() which calls a recursive method : 
FileUtils.isActionPermittedForFileHierarchy() with the object or the ancestor 
of the object we are trying to authorize if the object does not exist. 

The logic in FileUtils.isActionPermittedForFileHierarchy() is DFS.

Now assume, we have a path as a/b/c/d that we are trying to authorize.
In case, a/b/c/d does not exist, we would call 
FileUtils.isActionPermittedForFileHierarchy() with say a/b/ assuming a/b/c also 
does not exist.
If under the subtree at a/b, we have millions of files, then 
FileUtils.isActionPermittedForFileHierarchy()  is going to check file 
permission on each of those objects. 

I do not completely understand why do we have to check for file permissions in 
all the objects in  branch of the tree that we are not  trying to read from 
/write to.  
We could have checked file permission on the ancestor that exists and if it 
matches what we expect, the return true.

Please confirm if this is a bug so that I can submit a patch else let me know 
what I am missing ?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-12-11 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243037#comment-14243037
 ] 

Pankit Thapar commented on HIVE-8955:
-

[~ashutoshc] , do you have any insight on this?


 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-8955:
---

 Summary: alter partition should check for hive.stats.autogather 
in hiveConf
 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


When alter partition code path is triggered, it should check for the flag 
hive.stats.autogather, if it is true, then only updateStats else skip them.
This is done in append_partition code flow. 
Is there any specific reason the alter_partition does not respect this conf 
variable?

//code snippet : HiveMetastore.java 
 private Partition append_partition_common(RawStore ms, String dbName, String 
tableName,
ListString part_vals, EnvironmentContext envContext) throws 
InvalidObjectException,
AlreadyExistsException, MetaException {
...

if (HiveConf.getBoolVar(hiveConf, 
HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
!MetaStoreUtils.isView(tbl)) {
  MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
}
...
...
}

The above code snippet checks for the variable but this same check is absent in 

//code snippet : HiveAlterHandler.java 
public Partition alterPartition(final RawStore msdb, Warehouse wh, final String 
dbname,
  final String name, final ListString part_vals, final Partition new_part)
  throws InvalidOperationException, InvalidObjectException, 
AlreadyExistsException,
  MetaException {


...
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223568#comment-14223568
 ] 

Pankit Thapar commented on HIVE-8955:
-

[~szehon] , Can you please confirm if this is a bug or intentional?


 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223712#comment-14223712
 ] 

Pankit Thapar commented on HIVE-8955:
-

Thanks for a quick glance [~szehon]. As far as I can tell, its the same flag 
hive.stats.autogather

// Statistics
HIVESTATSAUTOGATHER(hive.stats.autogather, true,
A flag to gather statistics automatically during the INSERT OVERWRITE 
command.),


But this flag is not used in the code flow for alter_partition

 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223778#comment-14223778
 ] 

Pankit Thapar commented on HIVE-8955:
-

Yes you are correct that the stats are updated in insert overwrite but insert 
overwrite might itself call append_partition or alter_partition. In case of 
append, it respects the flag but not in case of alter partition.

 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223782#comment-14223782
 ] 

Pankit Thapar commented on HIVE-8955:
-

[~ashutoshc] , can you please confirm if this is a bug or not?

 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf

2014-11-24 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223797#comment-14223797
 ] 

Pankit Thapar commented on HIVE-8955:
-

if I insert overwrite into an already existing partition, I see that it does 
the stats update even when hive.stats.autogather is set to false.
for example: 
 [hadoop@ip-10-169-146-156 ~]$ hive --hiveconf hive.log.dir=. --hiveconf 
hive.stats.autogather=false 
hive create table test(x string, y string,z string) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ,;
hive LOAD DATA LOCAL INPATH './file.txt' OVERWRITE INTO TABLE test;
hive create table test_part(a string) PARTITIONED BY (x string, y string) 
LOCATION 'my table location';  
hive set hive.exec.dynamic.partition=true;  
hive  set hive.exec.dynamic.partition.mode=nonstrict;   
hive  INSERT OVERWRITE TABLE test_part PARTITION (x,y) select x,y,z from test;

I see update stats for the last query. 



 alter partition should check for hive.stats.autogather in hiveConf
 

 Key: HIVE-8955
 URL: https://issues.apache.org/jira/browse/HIVE-8955
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.15.0


 When alter partition code path is triggered, it should check for the flag 
 hive.stats.autogather, if it is true, then only updateStats else skip them.
 This is done in append_partition code flow. 
 Is there any specific reason the alter_partition does not respect this conf 
 variable?
 //code snippet : HiveMetastore.java 
  private Partition append_partition_common(RawStore ms, String dbName, String 
 tableName,
 ListString part_vals, EnvironmentContext envContext) throws 
 InvalidObjectException,
 AlreadyExistsException, MetaException {
 ...
 
 if (HiveConf.getBoolVar(hiveConf, 
 HiveConf.ConfVars.HIVESTATSAUTOGATHER) 
 !MetaStoreUtils.isView(tbl)) {
   MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir);
 }
 ...
 ...
 }
 The above code snippet checks for the variable but this same check is absent 
 in 
 //code snippet : HiveAlterHandler.java 
 public Partition alterPartition(final RawStore msdb, Warehouse wh, final 
 String dbname,
   final String name, final ListString part_vals, final Partition 
 new_part)
   throws InvalidOperationException, InvalidObjectException, 
 AlreadyExistsException,
   MetaException {
 
 ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-11-14 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212423#comment-14212423
 ] 

Pankit Thapar commented on HIVE-8137:
-

@prasanthj in hive 14 release, this JIRA is listed as improvments I think and I 
do not see the patch being there.

Also, I ll think about the suggestion. Thanks for your feedback.

 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.2.patch, HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-11-12 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208669#comment-14208669
 ] 

Pankit Thapar commented on HIVE-8137:
-

[~hagleitn] : Can some one please the changes ?

 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Attachments: HIVE-8137.2.patch, HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-11-12 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208672#comment-14208672
 ] 

Pankit Thapar commented on HIVE-8137:
-

[~hagleitn] : Can some one please review the changes?

 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Attachments: HIVE-8137.2.patch, HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8137) Empty ORC file handling

2014-11-12 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8137:

Fix Version/s: 0.14.0

 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.2.patch, HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8400) Fix building and packaging hwi war file

2014-11-12 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8400:

Fix Version/s: 0.14.0

 Fix building and packaging hwi war file
 ---

 Key: HIVE-8400
 URL: https://issues.apache.org/jira/browse/HIVE-8400
 Project: Hive
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8400.1.patch


 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not 
 configured to build a war file, instead it builds a jar file for hwi.
 A fix for this would be to change jar to war in hwi/pom.xml.
 Diff is below : 
 diff --git a/hwi/pom.xml b/hwi/pom.xml
 index 35c124b..88d83fb 100644
 --- a/hwi/pom.xml
 +++ b/hwi/pom.xml
 @@ -24,7 +24,7 @@
/parent
  
artifactIdhive-hwi/artifactId
 -  packagingjar/packaging
 +  packagingwar/packaging
nameHive HWI/name
 Please let me know if jar was intended or is it actually a bug so that I can 
 submit a patch for the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8137) Empty ORC file handling

2014-10-13 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8137:

Attachment: HIVE-8137.2.patch

Get the input FileSystem for each file on input path instead of getting it only 
from the first path on the list of input paths.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.2.patch, HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8400) hwi does not have war file

2014-10-13 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170191#comment-14170191
 ] 

Pankit Thapar commented on HIVE-8400:
-

[~vikram.dixit] Could you please take a look at this ?


 hwi does not have war file
 --

 Key: HIVE-8400
 URL: https://issues.apache.org/jira/browse/HIVE-8400
 Project: Hive
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not 
 configured to build a war file, instead it builds a jar file for hwi.
 A fix for this would be to change jar to war in hwi/pom.xml.
 Diff is below : 
 diff --git a/hwi/pom.xml b/hwi/pom.xml
 index 35c124b..88d83fb 100644
 --- a/hwi/pom.xml
 +++ b/hwi/pom.xml
 @@ -24,7 +24,7 @@
/parent
  
artifactIdhive-hwi/artifactId
 -  packagingjar/packaging
 +  packagingwar/packaging
nameHive HWI/name
 Please let me know if jar was intended or is it actually a bug so that I can 
 submit a patch for the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8400) hwi does not have war file

2014-10-13 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170211#comment-14170211
 ] 

Pankit Thapar commented on HIVE-8400:
-

I think this should go into 0.13.2 since this is something that is broken. 
Correct me if I am wrong.

 hwi does not have war file
 --

 Key: HIVE-8400
 URL: https://issues.apache.org/jira/browse/HIVE-8400
 Project: Hive
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not 
 configured to build a war file, instead it builds a jar file for hwi.
 A fix for this would be to change jar to war in hwi/pom.xml.
 Diff is below : 
 diff --git a/hwi/pom.xml b/hwi/pom.xml
 index 35c124b..88d83fb 100644
 --- a/hwi/pom.xml
 +++ b/hwi/pom.xml
 @@ -24,7 +24,7 @@
/parent
  
artifactIdhive-hwi/artifactId
 -  packagingjar/packaging
 +  packagingwar/packaging
nameHive HWI/name
 Please let me know if jar was intended or is it actually a bug so that I can 
 submit a patch for the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8400) hwi does not have war file

2014-10-11 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168285#comment-14168285
 ] 

Pankit Thapar commented on HIVE-8400:
-

[~gopalv] , can you take a look at this?
Thanks!

 hwi does not have war file
 --

 Key: HIVE-8400
 URL: https://issues.apache.org/jira/browse/HIVE-8400
 Project: Hive
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not 
 configured to build a war file, instead it builds a jar file for hwi.
 A fix for this would be to change jar to war in hwi/pom.xml.
 Diff is below : 
 diff --git a/hwi/pom.xml b/hwi/pom.xml
 index 35c124b..88d83fb 100644
 --- a/hwi/pom.xml
 +++ b/hwi/pom.xml
 @@ -24,7 +24,7 @@
/parent
  
artifactIdhive-hwi/artifactId
 -  packagingjar/packaging
 +  packagingwar/packaging
nameHive HWI/name
 Please let me know if jar was intended or is it actually a bug so that I can 
 submit a patch for the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8400) hwi does not have war file

2014-10-08 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-8400:
---

 Summary: hwi does not have war file
 Key: HIVE-8400
 URL: https://issues.apache.org/jira/browse/HIVE-8400
 Project: Hive
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not 
configured to build a war file, instead it builds a jar file for hwi.

A fix for this would be to change jar to war in hwi/pom.xml.
Diff is below : 
diff --git a/hwi/pom.xml b/hwi/pom.xml
index 35c124b..88d83fb 100644
--- a/hwi/pom.xml
+++ b/hwi/pom.xml
@@ -24,7 +24,7 @@
   /parent
 
   artifactIdhive-hwi/artifactId
-  packagingjar/packaging
+  packagingwar/packaging
   nameHive HWI/name

Please let me know if jar was intended or is it actually a bug so that I can 
submit a patch for the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-10-08 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164025#comment-14164025
 ] 

Pankit Thapar commented on HIVE-8137:
-

Can you please tell me how to go about this?


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-10-08 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164027#comment-14164027
 ] 

Pankit Thapar commented on HIVE-8137:
-

Also, I have added two unit test cases to cover my changes.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-10-06 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160456#comment-14160456
 ] 

Pankit Thapar commented on HIVE-8137:
-

[~gopalv] , could you please comment on the failures. I don't think that the 
above failures are due to my patch. 
Could you please comment on the same?
Also, could you please review the patch as well?


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8137) Empty ORC file handling

2014-10-04 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8137:

Attachment: HIVE-8137.patch

Current Logic
==
CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim 
which is a child class for CombinFileInputFormat (in hadoop).
CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which 
creates splits w/o checking for the file size. So, as a result we 
get combineFileSplits which have empty files. 

Issue with the current logic
=
Existence of empty files is not correct for ORC files since the format requires 
certain things like post-scrips to be present in the file.
this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to 
access the post-script which is not present in the empty file.

Fix

1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that 
when CombineFileInputFormat.getsplits() calls, listStatus(),
it ends up calling CombineFileInputFormatShim.listStatus() which has the logic 
for skipping empty Files when creating splits.

2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator.

Testing
===
Added two unit tests to test the the two fixes.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8137) Empty ORC file handling

2014-10-04 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8137:

Status: Patch Available  (was: Open)

Current Logic
==
CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim 
which is a child class for CombinFileInputFormat (in hadoop).
CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which 
creates splits w/o checking for the file size. So, as a result we 
get combineFileSplits which have empty files. 

Issue with the current logic
=
Existence of empty files is not correct for ORC files since the format requires 
certain things like post-scrips to be present in the file.
this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to 
access the post-script which is not present in the empty file.

Fix

1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that 
when CombineFileInputFormat.getsplits() calls, listStatus(),
it ends up calling CombineFileInputFormatShim.listStatus() which has the logic 
for skipping empty Files when creating splits.

2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator.

Testing
===
Added two unit tests to test the the two fixes.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8137.patch


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-09-17 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137591#comment-14137591
 ] 

Pankit Thapar commented on HIVE-8137:
-

I think Tez works in this case  because in Tez related code flow, hive creates 
files for empty tables.
I dont know if that would be the right approach for OrcInputFormat.
Also, one way to avoid creating split would be to list file status in 
CombineHiveInputFormat.getSplits() and filter out zero length files. then pass 
on that list to hadoop. But going this way, we add an O(n) overhead of getting 
file status.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-09-16 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135778#comment-14135778
 ] 

Pankit Thapar commented on HIVE-8137:
-

The issue is hadoop might create a split in case its a CombineInputFormat. 
Hadoop specifically creates empty splits.

 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8137) Empty ORC file handling

2014-09-16 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135821#comment-14135821
 ] 

Pankit Thapar commented on HIVE-8137:
-

I ran an insert overwrite query from an empty table into an orc table. That 
triggered Hadoop's CombineFileInputFormat which does not check if the split is 
empty or not.


 Empty ORC file handling
 ---

 Key: HIVE-8137
 URL: https://issues.apache.org/jira/browse/HIVE-8137
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file 
 is suposed to have a post-script
 which the ReaderIml class tries to read and initialize the footer with it. 
 But in case, the file is empty 
 or is of zero size, then it runs into an IndexOutOfBound Exception because of 
 ReaderImpl trying to read in its constructor.
 Code Snippet : 
 //get length of PostScript
 int psLen = buffer.get(readSize - 1)  0xff; 
 In the above code, readSize for an empty file is zero.
 I see that ensureOrcFooter() method performs some sanity checks for footer , 
 so, either we can move the above code snippet to ensureOrcFooter() and throw 
 a Malformed ORC file exception or we can create a dummy Reader that does 
 not initialize footer and basically has hasNext() set to false so that it 
 returns false on the first call.
 Basically, I would like to know what might be the correct way to handle an 
 empty ORC file in a mapred job?
 Should we neglect it and not throw an exception or we can throw an exeption 
 that the ORC file is malformed.
 Please let me know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-16 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135848#comment-14135848
 ] 

Pankit Thapar commented on HIVE-8038:
-

Is .3.patch commited to trunk?


 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Assignee: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6554) CombineHiveInputFormat should use the underlying InputSplits

2014-09-16 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135863#comment-14135863
 ] 

Pankit Thapar commented on HIVE-6554:
-

Is there any update in this?


 CombineHiveInputFormat should use the underlying InputSplits
 

 Key: HIVE-6554
 URL: https://issues.apache.org/jira/browse/HIVE-6554
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Owen O'Malley

 Currently CombineHiveInputFormat generates FileSplits without using the 
 underlying InputFormat. This leads to a problem when an InputFormat needs a 
 InputSplit that isn't exactly a FileSplit, because CombineHiveInputSplit 
 always generates FileSplits and then calls the underlying InputFormats 
 getRecordReader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Attachment: HIVE-8038.2.patch

Attached the fixed patch as per CR : https://reviews.apache.org/r/25521/

 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.2.patch, HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-15 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134413#comment-14134413
 ] 

Pankit Thapar commented on HIVE-8038:
-

[~gopalv] , Thanks for taking a look.  I have changed the exception to 
IOException and uploaded the new patch here.


 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.2.patch, HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-12 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131648#comment-14131648
 ] 

Pankit Thapar commented on HIVE-8038:
-

Hi,

Can you please take a look at this cr : https://reviews.apache.org/r/25521/

Thanks,
Pankit

 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-12 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132263#comment-14132263
 ] 

Pankit Thapar commented on HIVE-8038:
-

Hi Gopal,

Thanks for taking a look.
I have uploaded an updated diff on the cr with the changes recommended. And 
also, commented on the feedback.
Please let me know your feedback on the same.


 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-8038:
---

 Summary: Decouple ORC files split calculation logic from 
Filesystem's get file location implementation
 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0


What is the Current Logic
==
1.get the file blocks from FileSystem.getFileBlockLocations() which returns an 
array of BlockLocation
2.In SplitGenerator.createSplit(), check if split only spans one block or 
multiple blocks.
3.If split spans just one block, then using the array index (index = 
offset/blockSize), get the corresponding host having the blockLocation
4.If the split spans multiple blocks, then get all hosts that have at least 80% 
of the max of total data in split hosted by any host.
5.add the split to a list of splits

Issue with Current Logic
=
Dependency on FileSystem API’s logic for block location calculations. It 
returns an array and we need to rely on FileSystem to  
make all blocks of same size if we want to directly access a block from the 
array.
 
What is the Fix
=
1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an 
array of BlockLocation
1b.convert the array into a tree map offset, BlockLocation and return it 
through getLocationsWithOffSet()
2.In SplitGenerator.createSplit(), check if split only spans one block or 
multiple blocks.
3.If split spans just one block, then using Tree.floorEntry(key), get the 
highest entry smaller than offset for the split and get the corresponding host.
4a.If the split spans multiple blocks, get a submap, which contains all entries 
containing blockLocations from the offset to offset + length
4b.get all hosts that have at least 80% of the max of total data in split 
hosted by any host.
5.add the split to a list of splits

What are the major changes in logic
==
1. store BlockLocations in a Map instead of an array
2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
3. one block case is checked by if(offset + length = start.getOffset() + 
start.getLength())  instead of if((offset % blockSize) + length = blockSize)

What is the affect on Complexity (Big O)
=

1. We add a O(n) loop to build a TreeMap from an array but its a one time cost 
and would not be called for each split
2. In case of one block case, we can get the block in O(logn) worst case which 
was O(1) before
3. Getting the submap is O(logn)
4. In case of multiple block case, building the list of hosts is O(m) which was 
O(n)  m  n as previously we were iterating 
   over all the block locations but now we are only iterating only blocks that 
belong to that range go offsets that we need. 

What are the benefits of the change
==
1. With this fix, we do not depend on the blockLocations returned by FileSystem 
to figure out the block corresponding to the offset and blockSize
2. Also, it is not necessary that block lengths is same for all blocks for all 
FileSystems
3. Previously we were using blockSize for one block case and block.length for 
multiple block case, which is not the case now. We figure out the block
   depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Attachment: HIVE-8038.patch

 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-8038:

Status: Patch Available  (was: Open)

 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128871#comment-14128871
 ] 

Pankit Thapar commented on HIVE-8038:
-

Hi,

Thanks for the feedback.
1. The use case where the split may span more than one block would be when  
Math.min(MAX_BLOCK_SIZE, 2 * stripeSize) returns MAX_BLOCK_SIZE as the size of 
the block for the file.
Example : stripe size 512MB and BLOCK SIZE is 400MB, in that case, split would 
span more than one block.

2. I see that HDFS wants to support variable length blocks but what I meant was 
to remove the usage of blockSize variable all together as that is not true for 
all the FileSystems. We want to generalize the usage for  FileSystems apart 
from HDFS.

 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129257#comment-14129257
 ] 

Pankit Thapar commented on HIVE-8038:
-

We have a custom Filesystem implementation over S3. Our block allocation logic 
is a little different from hdfs.

So, I will go ahead and see the failed test and try to fix it.
Do you have comments on the code change ?


 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation

2014-09-10 Thread Pankit Thapar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129266#comment-14129266
 ] 

Pankit Thapar commented on HIVE-8038:
-

org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadDataPrimitiveTypes fails 
even without the patch I submitted.
Can someone , please confirm that?



 Decouple ORC files split calculation logic from Filesystem's get file 
 location implementation
 -

 Key: HIVE-8038
 URL: https://issues.apache.org/jira/browse/HIVE-8038
 Project: Hive
  Issue Type: Improvement
  Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
 Fix For: 0.14.0

 Attachments: HIVE-8038.patch


 What is the Current Logic
 ==
 1.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using the array index (index = 
 offset/blockSize), get the corresponding host having the blockLocation
 4.If the split spans multiple blocks, then get all hosts that have at least 
 80% of the max of total data in split hosted by any host.
 5.add the split to a list of splits
 Issue with Current Logic
 =
 Dependency on FileSystem API’s logic for block location calculations. It 
 returns an array and we need to rely on FileSystem to  
 make all blocks of same size if we want to directly access a block from the 
 array.
  
 What is the Fix
 =
 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns 
 an array of BlockLocation
 1b.convert the array into a tree map offset, BlockLocation and return it 
 through getLocationsWithOffSet()
 2.In SplitGenerator.createSplit(), check if split only spans one block or 
 multiple blocks.
 3.If split spans just one block, then using Tree.floorEntry(key), get the 
 highest entry smaller than offset for the split and get the corresponding 
 host.
 4a.If the split spans multiple blocks, get a submap, which contains all 
 entries containing blockLocations from the offset to offset + length
 4b.get all hosts that have at least 80% of the max of total data in split 
 hosted by any host.
 5.add the split to a list of splits
 What are the major changes in logic
 ==
 1. store BlockLocations in a Map instead of an array
 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
 3. one block case is checked by if(offset + length = start.getOffset() + 
 start.getLength())  instead of if((offset % blockSize) + length = 
 blockSize)
 What is the affect on Complexity (Big O)
 =
 1. We add a O(n) loop to build a TreeMap from an array but its a one time 
 cost and would not be called for each split
 2. In case of one block case, we can get the block in O(logn) worst case 
 which was O(1) before
 3. Getting the submap is O(logn)
 4. In case of multiple block case, building the list of hosts is O(m) which 
 was O(n)  m  n as previously we were iterating 
over all the block locations but now we are only iterating only blocks 
 that belong to that range go offsets that we need. 
 What are the benefits of the change
 ==
 1. With this fix, we do not depend on the blockLocations returned by 
 FileSystem to figure out the block corresponding to the offset and blockSize
 2. Also, it is not necessary that block lengths is same for all blocks for 
 all FileSystems
 3. Previously we were using blockSize for one block case and block.length for 
 multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HIVE-7251) Fix StorageDescriptor usage in unit tests

2014-06-18 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-7251:
---

 Summary: Fix StorageDescriptor usage in unit tests 
 Key: HIVE-7251
 URL: https://issues.apache.org/jira/browse/HIVE-7251
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Priority: Minor


Current Approach : 
StorageDescriptor class is used to describe parameters like InputFormat, 
Outputformat, SerDeInfo, etc. for a hive table.
Some of the class variables like InputFormat, OutputFormat, 
SerDeInfo.serializationLib, etc. are required fields when
creating a storage descriptor object.
For example : createTable command in the metaStoreClient creates the table with 
the default values of such variables defined in HiveConf or hive-default.xml
But in unit tests, table is created in a slightly different way, that these 
values need to be set explicitly.
Thus, when creating tables in tests, required fieldes of StorageDescriptor 
object need to be set.

Issue with current approach :
From some of  the current usages of this class in unit tests, I noticed that
when any one of the test cases tried to clean up the database and found a table 
created by any of the previously executed test cases,
the clean up process tries to fetch the Table object and performs the sanity 
checks which include checking for required fields like InputFormat, 
OutputFormat, SerDeInfo.serializationLib
of the table. The sanity checks fail which results in failure of the test case.

Fix :
In unit-tests, StorageDescriptor object should be created with the Fields that 
are sanity checked when trying to fetch the table.

NOTE : This fix fixes 6 test cases in itests/hive-unit/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7251) Fix StorageDescriptor usage in unit tests

2014-06-18 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7251:


Attachment: HIVE-7251.patch

 Fix StorageDescriptor usage in unit tests 
 --

 Key: HIVE-7251
 URL: https://issues.apache.org/jira/browse/HIVE-7251
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7251.patch


 Current Approach : 
 StorageDescriptor class is used to describe parameters like InputFormat, 
 Outputformat, SerDeInfo, etc. for a hive table.
 Some of the class variables like InputFormat, OutputFormat, 
 SerDeInfo.serializationLib, etc. are required fields when
 creating a storage descriptor object.
 For example : createTable command in the metaStoreClient creates the table 
 with the default values of such variables defined in HiveConf or 
 hive-default.xml
 But in unit tests, table is created in a slightly different way, that these 
 values need to be set explicitly.
 Thus, when creating tables in tests, required fieldes of StorageDescriptor 
 object need to be set.
 Issue with current approach :
 From some of  the current usages of this class in unit tests, I noticed that
 when any one of the test cases tried to clean up the database and found a 
 table created by any of the previously executed test cases,
 the clean up process tries to fetch the Table object and performs the sanity 
 checks which include checking for required fields like InputFormat, 
 OutputFormat, SerDeInfo.serializationLib
 of the table. The sanity checks fail which results in failure of the test 
 case.
 Fix :
 In unit-tests, StorageDescriptor object should be created with the Fields 
 that are sanity checked when trying to fetch the table.
 NOTE : This fix fixes 6 test cases in itests/hive-unit/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7251) Fix StorageDescriptor usage in unit tests

2014-06-18 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7251:


Status: Patch Available  (was: Open)

 Fix StorageDescriptor usage in unit tests 
 --

 Key: HIVE-7251
 URL: https://issues.apache.org/jira/browse/HIVE-7251
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7251.patch


 Current Approach : 
 StorageDescriptor class is used to describe parameters like InputFormat, 
 Outputformat, SerDeInfo, etc. for a hive table.
 Some of the class variables like InputFormat, OutputFormat, 
 SerDeInfo.serializationLib, etc. are required fields when
 creating a storage descriptor object.
 For example : createTable command in the metaStoreClient creates the table 
 with the default values of such variables defined in HiveConf or 
 hive-default.xml
 But in unit tests, table is created in a slightly different way, that these 
 values need to be set explicitly.
 Thus, when creating tables in tests, required fieldes of StorageDescriptor 
 object need to be set.
 Issue with current approach :
 From some of  the current usages of this class in unit tests, I noticed that
 when any one of the test cases tried to clean up the database and found a 
 table created by any of the previously executed test cases,
 the clean up process tries to fetch the Table object and performs the sanity 
 checks which include checking for required fields like InputFormat, 
 OutputFormat, SerDeInfo.serializationLib
 of the table. The sanity checks fail which results in failure of the test 
 case.
 Fix :
 In unit-tests, StorageDescriptor object should be created with the Fields 
 that are sanity checked when trying to fetch the table.
 NOTE : This fix fixes 6 test cases in itests/hive-unit/



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-16 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Status: Patch Available  (was: Open)

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Assignee: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, 
 HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-16 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Attachment: HIVE-7201.04.patch

Updated patch

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Assignee: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, 
 HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-16 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Affects Version/s: (was: 0.13.0)
   0.13.1

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Assignee: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, 
 HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7228) StreamPrinter should be joined to calling thread

2014-06-13 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-7228:
---

 Summary: StreamPrinter should be joined to calling thread 
 Key: HIVE-7228
 URL: https://issues.apache.org/jira/browse/HIVE-7228
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor


ISSUE:
StreamPrinter class is used for connecting an input stream (connected to 
output) of a process with the output stream of a Session 
(CliSessionState/SessionState class)
It acts as a pipe between the two and transfers data from input stream to the 
output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. 

From some of the current usages of this class, I noticed that the calling 
threads do not wait for the transfer operation to be completed. That is, the 
calling thread does not join the SteamPrinter threads.
The calling thread would move forward thinking that the respective output 
stream already has the data needed. But, it is not always the right assumption 
since, it might happen that
the StreamPrinter thread did not finish execution by the time it was expected 
by the calling thread.

FIX:
To ensure that calling thread waits for the StreamPrinter threads to complete, 
StreamPrinter threads are joined to calling thread.

Please note , without the fix, TestCliDriverMethods#testRun failed sometimes 
(like 1 in 30 times). This test would not fail with this fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-13 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Status: Patch Available  (was: Open)

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, 
 HIVE-7201.03.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7228) StreamPrinter should be joined to calling thread

2014-06-13 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7228:


Attachment: HIVE-7228.patch

Added join() to usages of StreamPrinter

 StreamPrinter should be joined to calling thread 
 -

 Key: HIVE-7228
 URL: https://issues.apache.org/jira/browse/HIVE-7228
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7228.patch


 ISSUE:
 StreamPrinter class is used for connecting an input stream (connected to 
 output) of a process with the output stream of a Session 
 (CliSessionState/SessionState class)
 It acts as a pipe between the two and transfers data from input stream to the 
 output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. 
 From some of the current usages of this class, I noticed that the calling 
 threads do not wait for the transfer operation to be completed. That is, the 
 calling thread does not join the SteamPrinter threads.
 The calling thread would move forward thinking that the respective output 
 stream already has the data needed. But, it is not always the right 
 assumption since, it might happen that
 the StreamPrinter thread did not finish execution by the time it was expected 
 by the calling thread.
 FIX:
 To ensure that calling thread waits for the StreamPrinter threads to 
 complete, StreamPrinter threads are joined to calling thread.
 Please note , without the fix, TestCliDriverMethods#testRun failed sometimes 
 (like 1 in 30 times). This test would not fail with this fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7228) StreamPrinter should be joined to calling thread

2014-06-13 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7228:


Status: Patch Available  (was: Open)

 StreamPrinter should be joined to calling thread 
 -

 Key: HIVE-7228
 URL: https://issues.apache.org/jira/browse/HIVE-7228
 Project: Hive
  Issue Type: Bug
  Components: CLI
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7228.patch


 ISSUE:
 StreamPrinter class is used for connecting an input stream (connected to 
 output) of a process with the output stream of a Session 
 (CliSessionState/SessionState class)
 It acts as a pipe between the two and transfers data from input stream to the 
 output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. 
 From some of the current usages of this class, I noticed that the calling 
 threads do not wait for the transfer operation to be completed. That is, the 
 calling thread does not join the SteamPrinter threads.
 The calling thread would move forward thinking that the respective output 
 stream already has the data needed. But, it is not always the right 
 assumption since, it might happen that
 the StreamPrinter thread did not finish execution by the time it was expected 
 by the calling thread.
 FIX:
 To ensure that calling thread waits for the StreamPrinter threads to 
 complete, StreamPrinter threads are joined to calling thread.
 Please note , without the fix, TestCliDriverMethods#testRun failed sometimes 
 (like 1 in 30 times). This test would not fail with this fix.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-12 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Attachment: HIVE-7201-2.patch

This is the correct patch. Previous was re based to trunk. This is the correct 
patch, re based to latest branch-0.13. 

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-12 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Attachment: HIVE-7201.03.patch

Renamed the patch to kick in the autobuild
Rebaed the patch to trunk instead of branch-0.13

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, 
 HIVE-7201.03.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Attachment: HIVE-7201-1.patch

The patch is rebased against the latest trunk for branch-13

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-10 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Status: Patch Available  (was: Open)

The patch is rebased against the latest trunk for brach-13

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201-1.patch, HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-09 Thread Pankit Thapar (JIRA)
Pankit Thapar created HIVE-7201:
---

 Summary: Fix TestHiveConf#testConfProperties test case
 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor


CHANGE 1: 

TEST CASE :
The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
properties being set in the priority as expected.

Each HiveConf object is initialized as follows:
1) Hadoop configuration properties are applied.
2) ConfVar properties with non-null values are overlayed.
3) hive-site.xml properties are overlayed.

ISSUE :
The mapreduce related configurations are loaded by JobConf and not 
Configuration.
The current test tries to get the configuration properties  like : 
HADOOPNUMREDUCERS (mapred.job.reduces)
from Configuration class. But these mapreduce related properties are loaded by 
JobConf class from mapred-default.xml.

DETAILS :
LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
Because, 
private void  checkHadoopConf(String name, String expectedHadoopVal) {
 Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
 Second parameter is null, since its the JobConf class and not the 
Configuration class that initializes mapred-default values. 
}

Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
like this (in static block):
public class JobConf extends Configuration {
  
  private static final Log LOG = LogFactory.getLog(JobConf.class);

  static{
ConfigUtil.loadResources(); -- loads mapreduce related resources 
(mapreduce-default.xml)
  }
.
}

Please note, the test case assertion works fine if HiveConf() constructor is 
called before this assertion since, HiveConf() triggers JobConf()
which basically sets the default values of the properties pertaining to 
mapreduce.
This is why, there won't be any failures if testHiveSitePath() was run before 
testConfProperties() as that would load mapreduce
properties into config properties.


FIX:
Instead of using a Configuration object, we can use the JobConf object to get 
the default values used by hadoop/mapreduce.

CHANGE 2:
In TestHiveConf#testHiveSitePath(), a call to static method 
getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-09 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Status: Patch Available  (was: Open)

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case

2014-06-09 Thread Pankit Thapar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pankit Thapar updated HIVE-7201:


Attachment: HIVE-7201.patch

 Fix TestHiveConf#testConfProperties test case
 -

 Key: HIVE-7201
 URL: https://issues.apache.org/jira/browse/HIVE-7201
 Project: Hive
  Issue Type: Bug
  Components: Tests
Affects Versions: 0.13.0
Reporter: Pankit Thapar
Priority: Minor
 Attachments: HIVE-7201.patch


 CHANGE 1: 
 TEST CASE :
 The intention of TestHiveConf#testConfProperties() is to test the HiveConf 
 properties being set in the priority as expected.
 Each HiveConf object is initialized as follows:
 1) Hadoop configuration properties are applied.
 2) ConfVar properties with non-null values are overlayed.
 3) hive-site.xml properties are overlayed.
 ISSUE :
 The mapreduce related configurations are loaded by JobConf and not 
 Configuration.
 The current test tries to get the configuration properties  like : 
 HADOOPNUMREDUCERS (mapred.job.reduces)
 from Configuration class. But these mapreduce related properties are loaded 
 by JobConf class from mapred-default.xml.
 DETAILS :
 LINE  63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails
 Because, 
 private void  checkHadoopConf(String name, String expectedHadoopVal) {
  Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); 
  Second parameter is null, since its the JobConf class and not the 
 Configuration class that initializes mapred-default values. 
 }
 Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call 
 like this (in static block):
 public class JobConf extends Configuration {
   
   private static final Log LOG = LogFactory.getLog(JobConf.class);
   static{
 ConfigUtil.loadResources(); -- loads mapreduce related resources 
 (mapreduce-default.xml)
   }
 .
 }
 Please note, the test case assertion works fine if HiveConf() constructor is 
 called before this assertion since, HiveConf() triggers JobConf()
 which basically sets the default values of the properties pertaining to 
 mapreduce.
 This is why, there won't be any failures if testHiveSitePath() was run before 
 testConfProperties() as that would load mapreduce
 properties into config properties.
 FIX:
 Instead of using a Configuration object, we can use the JobConf object to get 
 the default values used by hadoop/mapreduce.
 CHANGE 2:
 In TestHiveConf#testHiveSitePath(), a call to static method 
 getHiveSiteLocation() should be called statically instead of using an object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)