[jira] [Created] (HIVE-10022) DFS in authorization might take too long
Pankit Thapar created HIVE-10022: Summary: DFS in authorization might take too long Key: HIVE-10022 URL: https://issues.apache.org/jira/browse/HIVE-10022 Project: Hive Issue Type: Bug Components: Authorization Affects Versions: 0.14.0 Reporter: Pankit Thapar I am testing a query like : set hive.test.authz.sstd.hs2.mode=true; set hive.security.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactoryForTest; set hive.security.authenticator.manager=org.apache.hadoop.hive.ql.security.SessionStateConfigUserAuthenticator; set hive.security.authorization.enabled=true; set user.name=user1; create table auth_noupd(i int) clustered by (i) into 2 buckets stored as orc location '${OUTPUT}' TBLPROPERTIES ('transactional'='true'); Now, in the above query, since authorization is true, we would end up calling doAuthorizationV2() which ultimately ends up calling SQLAuthorizationUtils.getPrivilegesFromFS() which calls a recursive method : FileUtils.isActionPermittedForFileHierarchy() with the object or the ancestor of the object we are trying to authorize if the object does not exist. The logic in FileUtils.isActionPermittedForFileHierarchy() is DFS. Now assume, we have a path as a/b/c/d that we are trying to authorize. In case, a/b/c/d does not exist, we would call FileUtils.isActionPermittedForFileHierarchy() with say a/b/ assuming a/b/c also does not exist. If under the subtree at a/b, we have millions of files, then FileUtils.isActionPermittedForFileHierarchy() is going to check file permission on each of those objects. I do not completely understand why do we have to check for file permissions in all the objects in branch of the tree that we are not trying to read from /write to. We could have checked file permission on the ancestor that exists and if it matches what we expect, the return true. Please confirm if this is a bug so that I can submit a patch else let me know what I am missing ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243037#comment-14243037 ] Pankit Thapar commented on HIVE-8955: - [~ashutoshc] , do you have any insight on this? alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
Pankit Thapar created HIVE-8955: --- Summary: alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223568#comment-14223568 ] Pankit Thapar commented on HIVE-8955: - [~szehon] , Can you please confirm if this is a bug or intentional? alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223712#comment-14223712 ] Pankit Thapar commented on HIVE-8955: - Thanks for a quick glance [~szehon]. As far as I can tell, its the same flag hive.stats.autogather // Statistics HIVESTATSAUTOGATHER(hive.stats.autogather, true, A flag to gather statistics automatically during the INSERT OVERWRITE command.), But this flag is not used in the code flow for alter_partition alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223778#comment-14223778 ] Pankit Thapar commented on HIVE-8955: - Yes you are correct that the stats are updated in insert overwrite but insert overwrite might itself call append_partition or alter_partition. In case of append, it respects the flag but not in case of alter partition. alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223782#comment-14223782 ] Pankit Thapar commented on HIVE-8955: - [~ashutoshc] , can you please confirm if this is a bug or not? alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8955) alter partition should check for hive.stats.autogather in hiveConf
[ https://issues.apache.org/jira/browse/HIVE-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223797#comment-14223797 ] Pankit Thapar commented on HIVE-8955: - if I insert overwrite into an already existing partition, I see that it does the stats update even when hive.stats.autogather is set to false. for example: [hadoop@ip-10-169-146-156 ~]$ hive --hiveconf hive.log.dir=. --hiveconf hive.stats.autogather=false hive create table test(x string, y string,z string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ,; hive LOAD DATA LOCAL INPATH './file.txt' OVERWRITE INTO TABLE test; hive create table test_part(a string) PARTITIONED BY (x string, y string) LOCATION 'my table location'; hive set hive.exec.dynamic.partition=true; hive set hive.exec.dynamic.partition.mode=nonstrict; hive INSERT OVERWRITE TABLE test_part PARTITION (x,y) select x,y,z from test; I see update stats for the last query. alter partition should check for hive.stats.autogather in hiveConf Key: HIVE-8955 URL: https://issues.apache.org/jira/browse/HIVE-8955 Project: Hive Issue Type: Improvement Components: Metastore Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.15.0 When alter partition code path is triggered, it should check for the flag hive.stats.autogather, if it is true, then only updateStats else skip them. This is done in append_partition code flow. Is there any specific reason the alter_partition does not respect this conf variable? //code snippet : HiveMetastore.java private Partition append_partition_common(RawStore ms, String dbName, String tableName, ListString part_vals, EnvironmentContext envContext) throws InvalidObjectException, AlreadyExistsException, MetaException { ... if (HiveConf.getBoolVar(hiveConf, HiveConf.ConfVars.HIVESTATSAUTOGATHER) !MetaStoreUtils.isView(tbl)) { MetaStoreUtils.updatePartitionStatsFast(part, wh, madeDir); } ... ... } The above code snippet checks for the variable but this same check is absent in //code snippet : HiveAlterHandler.java public Partition alterPartition(final RawStore msdb, Warehouse wh, final String dbname, final String name, final ListString part_vals, final Partition new_part) throws InvalidOperationException, InvalidObjectException, AlreadyExistsException, MetaException { ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212423#comment-14212423 ] Pankit Thapar commented on HIVE-8137: - @prasanthj in hive 14 release, this JIRA is listed as improvments I think and I do not see the patch being there. Also, I ll think about the suggestion. Thanks for your feedback. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.2.patch, HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208669#comment-14208669 ] Pankit Thapar commented on HIVE-8137: - [~hagleitn] : Can some one please the changes ? Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Attachments: HIVE-8137.2.patch, HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14208672#comment-14208672 ] Pankit Thapar commented on HIVE-8137: - [~hagleitn] : Can some one please review the changes? Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Attachments: HIVE-8137.2.patch, HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8137: Fix Version/s: 0.14.0 Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.2.patch, HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8400) Fix building and packaging hwi war file
[ https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8400: Fix Version/s: 0.14.0 Fix building and packaging hwi war file --- Key: HIVE-8400 URL: https://issues.apache.org/jira/browse/HIVE-8400 Project: Hive Issue Type: Bug Components: Web UI Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8400.1.patch hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not configured to build a war file, instead it builds a jar file for hwi. A fix for this would be to change jar to war in hwi/pom.xml. Diff is below : diff --git a/hwi/pom.xml b/hwi/pom.xml index 35c124b..88d83fb 100644 --- a/hwi/pom.xml +++ b/hwi/pom.xml @@ -24,7 +24,7 @@ /parent artifactIdhive-hwi/artifactId - packagingjar/packaging + packagingwar/packaging nameHive HWI/name Please let me know if jar was intended or is it actually a bug so that I can submit a patch for the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8137: Attachment: HIVE-8137.2.patch Get the input FileSystem for each file on input path instead of getting it only from the first path on the list of input paths. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.2.patch, HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8400) hwi does not have war file
[ https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170191#comment-14170191 ] Pankit Thapar commented on HIVE-8400: - [~vikram.dixit] Could you please take a look at this ? hwi does not have war file -- Key: HIVE-8400 URL: https://issues.apache.org/jira/browse/HIVE-8400 Project: Hive Issue Type: Bug Components: Web UI Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not configured to build a war file, instead it builds a jar file for hwi. A fix for this would be to change jar to war in hwi/pom.xml. Diff is below : diff --git a/hwi/pom.xml b/hwi/pom.xml index 35c124b..88d83fb 100644 --- a/hwi/pom.xml +++ b/hwi/pom.xml @@ -24,7 +24,7 @@ /parent artifactIdhive-hwi/artifactId - packagingjar/packaging + packagingwar/packaging nameHive HWI/name Please let me know if jar was intended or is it actually a bug so that I can submit a patch for the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8400) hwi does not have war file
[ https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170211#comment-14170211 ] Pankit Thapar commented on HIVE-8400: - I think this should go into 0.13.2 since this is something that is broken. Correct me if I am wrong. hwi does not have war file -- Key: HIVE-8400 URL: https://issues.apache.org/jira/browse/HIVE-8400 Project: Hive Issue Type: Bug Components: Web UI Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not configured to build a war file, instead it builds a jar file for hwi. A fix for this would be to change jar to war in hwi/pom.xml. Diff is below : diff --git a/hwi/pom.xml b/hwi/pom.xml index 35c124b..88d83fb 100644 --- a/hwi/pom.xml +++ b/hwi/pom.xml @@ -24,7 +24,7 @@ /parent artifactIdhive-hwi/artifactId - packagingjar/packaging + packagingwar/packaging nameHive HWI/name Please let me know if jar was intended or is it actually a bug so that I can submit a patch for the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8400) hwi does not have war file
[ https://issues.apache.org/jira/browse/HIVE-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168285#comment-14168285 ] Pankit Thapar commented on HIVE-8400: - [~gopalv] , can you take a look at this? Thanks! hwi does not have war file -- Key: HIVE-8400 URL: https://issues.apache.org/jira/browse/HIVE-8400 Project: Hive Issue Type: Bug Components: Web UI Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not configured to build a war file, instead it builds a jar file for hwi. A fix for this would be to change jar to war in hwi/pom.xml. Diff is below : diff --git a/hwi/pom.xml b/hwi/pom.xml index 35c124b..88d83fb 100644 --- a/hwi/pom.xml +++ b/hwi/pom.xml @@ -24,7 +24,7 @@ /parent artifactIdhive-hwi/artifactId - packagingjar/packaging + packagingwar/packaging nameHive HWI/name Please let me know if jar was intended or is it actually a bug so that I can submit a patch for the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8400) hwi does not have war file
Pankit Thapar created HIVE-8400: --- Summary: hwi does not have war file Key: HIVE-8400 URL: https://issues.apache.org/jira/browse/HIVE-8400 Project: Hive Issue Type: Bug Components: Web UI Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 hive 0.11 used to have hwi-0.11.war file but it seems that hive 13 is not configured to build a war file, instead it builds a jar file for hwi. A fix for this would be to change jar to war in hwi/pom.xml. Diff is below : diff --git a/hwi/pom.xml b/hwi/pom.xml index 35c124b..88d83fb 100644 --- a/hwi/pom.xml +++ b/hwi/pom.xml @@ -24,7 +24,7 @@ /parent artifactIdhive-hwi/artifactId - packagingjar/packaging + packagingwar/packaging nameHive HWI/name Please let me know if jar was intended or is it actually a bug so that I can submit a patch for the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164025#comment-14164025 ] Pankit Thapar commented on HIVE-8137: - Can you please tell me how to go about this? Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164027#comment-14164027 ] Pankit Thapar commented on HIVE-8137: - Also, I have added two unit test cases to cover my changes. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160456#comment-14160456 ] Pankit Thapar commented on HIVE-8137: - [~gopalv] , could you please comment on the failures. I don't think that the above failures are due to my patch. Could you please comment on the same? Also, could you please review the patch as well? Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8137: Attachment: HIVE-8137.patch Current Logic == CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim which is a child class for CombinFileInputFormat (in hadoop). CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which creates splits w/o checking for the file size. So, as a result we get combineFileSplits which have empty files. Issue with the current logic = Existence of empty files is not correct for ORC files since the format requires certain things like post-scrips to be present in the file. this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to access the post-script which is not present in the empty file. Fix 1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that when CombineFileInputFormat.getsplits() calls, listStatus(), it ends up calling CombineFileInputFormatShim.listStatus() which has the logic for skipping empty Files when creating splits. 2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator. Testing === Added two unit tests to test the the two fixes. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8137: Status: Patch Available (was: Open) Current Logic == CombineHiveInputFormat.getSplits() makes a call to CombineFileInputFormatShim which is a child class for CombinFileInputFormat (in hadoop). CombineFileInputFormatShim calls CombineFileInputFormat.getSplits(), which creates splits w/o checking for the file size. So, as a result we get combineFileSplits which have empty files. Issue with the current logic = Existence of empty files is not correct for ORC files since the format requires certain things like post-scrips to be present in the file. this ends up causing ArrayOutOfBound Exception in ORC reader since it tries to access the post-script which is not present in the empty file. Fix 1. Override listStatus of FileInputformat in CombineFileInputFormatShim,so that when CombineFileInputFormat.getsplits() calls, listStatus(), it ends up calling CombineFileInputFormatShim.listStatus() which has the logic for skipping empty Files when creating splits. 2. Also, avoid creating empty file splits in OrcInputFormat.FileGenerator. Testing === Added two unit tests to test the the two fixes. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8137.patch Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137591#comment-14137591 ] Pankit Thapar commented on HIVE-8137: - I think Tez works in this case because in Tez related code flow, hive creates files for empty tables. I dont know if that would be the right approach for OrcInputFormat. Also, one way to avoid creating split would be to list file status in CombineHiveInputFormat.getSplits() and filter out zero length files. then pass on that list to hadoop. But going this way, we add an O(n) overhead of getting file status. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135778#comment-14135778 ] Pankit Thapar commented on HIVE-8137: - The issue is hadoop might create a split in case its a CombineInputFormat. Hadoop specifically creates empty splits. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8137) Empty ORC file handling
[ https://issues.apache.org/jira/browse/HIVE-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135821#comment-14135821 ] Pankit Thapar commented on HIVE-8137: - I ran an insert overwrite query from an empty table into an orc table. That triggered Hadoop's CombineFileInputFormat which does not check if the split is empty or not. Empty ORC file handling --- Key: HIVE-8137 URL: https://issues.apache.org/jira/browse/HIVE-8137 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Hive 13 does not handle reading of a zero size Orc File properly. An Orc file is suposed to have a post-script which the ReaderIml class tries to read and initialize the footer with it. But in case, the file is empty or is of zero size, then it runs into an IndexOutOfBound Exception because of ReaderImpl trying to read in its constructor. Code Snippet : //get length of PostScript int psLen = buffer.get(readSize - 1) 0xff; In the above code, readSize for an empty file is zero. I see that ensureOrcFooter() method performs some sanity checks for footer , so, either we can move the above code snippet to ensureOrcFooter() and throw a Malformed ORC file exception or we can create a dummy Reader that does not initialize footer and basically has hasNext() set to false so that it returns false on the first call. Basically, I would like to know what might be the correct way to handle an empty ORC file in a mapred job? Should we neglect it and not throw an exception or we can throw an exeption that the ORC file is malformed. Please let me know your thoughts on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135848#comment-14135848 ] Pankit Thapar commented on HIVE-8038: - Is .3.patch commited to trunk? Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Assignee: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.2.patch, HIVE-8038.3.patch, HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6554) CombineHiveInputFormat should use the underlying InputSplits
[ https://issues.apache.org/jira/browse/HIVE-6554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135863#comment-14135863 ] Pankit Thapar commented on HIVE-6554: - Is there any update in this? CombineHiveInputFormat should use the underlying InputSplits Key: HIVE-6554 URL: https://issues.apache.org/jira/browse/HIVE-6554 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Currently CombineHiveInputFormat generates FileSplits without using the underlying InputFormat. This leads to a problem when an InputFormat needs a InputSplit that isn't exactly a FileSplit, because CombineHiveInputSplit always generates FileSplits and then calls the underlying InputFormats getRecordReader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Attachment: HIVE-8038.2.patch Attached the fixed patch as per CR : https://reviews.apache.org/r/25521/ Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.2.patch, HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134413#comment-14134413 ] Pankit Thapar commented on HIVE-8038: - [~gopalv] , Thanks for taking a look. I have changed the exception to IOException and uploaded the new patch here. Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.2.patch, HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131648#comment-14131648 ] Pankit Thapar commented on HIVE-8038: - Hi, Can you please take a look at this cr : https://reviews.apache.org/r/25521/ Thanks, Pankit Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132263#comment-14132263 ] Pankit Thapar commented on HIVE-8038: - Hi Gopal, Thanks for taking a look. I have uploaded an updated diff on the cr with the changes recommended. And also, commented on the feedback. Please let me know your feedback on the same. Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
Pankit Thapar created HIVE-8038: --- Summary: Decouple ORC files split calculation logic from Filesystem's get file location implementation Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Attachment: HIVE-8038.patch Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-8038: Status: Patch Available (was: Open) Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128871#comment-14128871 ] Pankit Thapar commented on HIVE-8038: - Hi, Thanks for the feedback. 1. The use case where the split may span more than one block would be when Math.min(MAX_BLOCK_SIZE, 2 * stripeSize) returns MAX_BLOCK_SIZE as the size of the block for the file. Example : stripe size 512MB and BLOCK SIZE is 400MB, in that case, split would span more than one block. 2. I see that HDFS wants to support variable length blocks but what I meant was to remove the usage of blockSize variable all together as that is not true for all the FileSystems. We want to generalize the usage for FileSystems apart from HDFS. Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129257#comment-14129257 ] Pankit Thapar commented on HIVE-8038: - We have a custom Filesystem implementation over S3. Our block allocation logic is a little different from hdfs. So, I will go ahead and see the failed test and try to fix it. Do you have comments on the code change ? Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8038) Decouple ORC files split calculation logic from Filesystem's get file location implementation
[ https://issues.apache.org/jira/browse/HIVE-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129266#comment-14129266 ] Pankit Thapar commented on HIVE-8038: - org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadDataPrimitiveTypes fails even without the patch I submitted. Can someone , please confirm that? Decouple ORC files split calculation logic from Filesystem's get file location implementation - Key: HIVE-8038 URL: https://issues.apache.org/jira/browse/HIVE-8038 Project: Hive Issue Type: Improvement Components: File Formats Affects Versions: 0.13.1 Reporter: Pankit Thapar Fix For: 0.14.0 Attachments: HIVE-8038.patch What is the Current Logic == 1.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using the array index (index = offset/blockSize), get the corresponding host having the blockLocation 4.If the split spans multiple blocks, then get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits Issue with Current Logic = Dependency on FileSystem API’s logic for block location calculations. It returns an array and we need to rely on FileSystem to make all blocks of same size if we want to directly access a block from the array. What is the Fix = 1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an array of BlockLocation 1b.convert the array into a tree map offset, BlockLocation and return it through getLocationsWithOffSet() 2.In SplitGenerator.createSplit(), check if split only spans one block or multiple blocks. 3.If split spans just one block, then using Tree.floorEntry(key), get the highest entry smaller than offset for the split and get the corresponding host. 4a.If the split spans multiple blocks, get a submap, which contains all entries containing blockLocations from the offset to offset + length 4b.get all hosts that have at least 80% of the max of total data in split hosted by any host. 5.add the split to a list of splits What are the major changes in logic == 1. store BlockLocations in a Map instead of an array 2. Call SHIMS.getLocationsWithOffSet() instead of getLocations() 3. one block case is checked by if(offset + length = start.getOffset() + start.getLength()) instead of if((offset % blockSize) + length = blockSize) What is the affect on Complexity (Big O) = 1. We add a O(n) loop to build a TreeMap from an array but its a one time cost and would not be called for each split 2. In case of one block case, we can get the block in O(logn) worst case which was O(1) before 3. Getting the submap is O(logn) 4. In case of multiple block case, building the list of hosts is O(m) which was O(n) m n as previously we were iterating over all the block locations but now we are only iterating only blocks that belong to that range go offsets that we need. What are the benefits of the change == 1. With this fix, we do not depend on the blockLocations returned by FileSystem to figure out the block corresponding to the offset and blockSize 2. Also, it is not necessary that block lengths is same for all blocks for all FileSystems 3. Previously we were using blockSize for one block case and block.length for multiple block case, which is not the case now. We figure out the block depending upon the actual length and offset of the block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-7251) Fix StorageDescriptor usage in unit tests
Pankit Thapar created HIVE-7251: --- Summary: Fix StorageDescriptor usage in unit tests Key: HIVE-7251 URL: https://issues.apache.org/jira/browse/HIVE-7251 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.1 Reporter: Pankit Thapar Priority: Minor Current Approach : StorageDescriptor class is used to describe parameters like InputFormat, Outputformat, SerDeInfo, etc. for a hive table. Some of the class variables like InputFormat, OutputFormat, SerDeInfo.serializationLib, etc. are required fields when creating a storage descriptor object. For example : createTable command in the metaStoreClient creates the table with the default values of such variables defined in HiveConf or hive-default.xml But in unit tests, table is created in a slightly different way, that these values need to be set explicitly. Thus, when creating tables in tests, required fieldes of StorageDescriptor object need to be set. Issue with current approach : From some of the current usages of this class in unit tests, I noticed that when any one of the test cases tried to clean up the database and found a table created by any of the previously executed test cases, the clean up process tries to fetch the Table object and performs the sanity checks which include checking for required fields like InputFormat, OutputFormat, SerDeInfo.serializationLib of the table. The sanity checks fail which results in failure of the test case. Fix : In unit-tests, StorageDescriptor object should be created with the Fields that are sanity checked when trying to fetch the table. NOTE : This fix fixes 6 test cases in itests/hive-unit/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7251) Fix StorageDescriptor usage in unit tests
[ https://issues.apache.org/jira/browse/HIVE-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7251: Attachment: HIVE-7251.patch Fix StorageDescriptor usage in unit tests -- Key: HIVE-7251 URL: https://issues.apache.org/jira/browse/HIVE-7251 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.1 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7251.patch Current Approach : StorageDescriptor class is used to describe parameters like InputFormat, Outputformat, SerDeInfo, etc. for a hive table. Some of the class variables like InputFormat, OutputFormat, SerDeInfo.serializationLib, etc. are required fields when creating a storage descriptor object. For example : createTable command in the metaStoreClient creates the table with the default values of such variables defined in HiveConf or hive-default.xml But in unit tests, table is created in a slightly different way, that these values need to be set explicitly. Thus, when creating tables in tests, required fieldes of StorageDescriptor object need to be set. Issue with current approach : From some of the current usages of this class in unit tests, I noticed that when any one of the test cases tried to clean up the database and found a table created by any of the previously executed test cases, the clean up process tries to fetch the Table object and performs the sanity checks which include checking for required fields like InputFormat, OutputFormat, SerDeInfo.serializationLib of the table. The sanity checks fail which results in failure of the test case. Fix : In unit-tests, StorageDescriptor object should be created with the Fields that are sanity checked when trying to fetch the table. NOTE : This fix fixes 6 test cases in itests/hive-unit/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7251) Fix StorageDescriptor usage in unit tests
[ https://issues.apache.org/jira/browse/HIVE-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7251: Status: Patch Available (was: Open) Fix StorageDescriptor usage in unit tests -- Key: HIVE-7251 URL: https://issues.apache.org/jira/browse/HIVE-7251 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.1 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7251.patch Current Approach : StorageDescriptor class is used to describe parameters like InputFormat, Outputformat, SerDeInfo, etc. for a hive table. Some of the class variables like InputFormat, OutputFormat, SerDeInfo.serializationLib, etc. are required fields when creating a storage descriptor object. For example : createTable command in the metaStoreClient creates the table with the default values of such variables defined in HiveConf or hive-default.xml But in unit tests, table is created in a slightly different way, that these values need to be set explicitly. Thus, when creating tables in tests, required fieldes of StorageDescriptor object need to be set. Issue with current approach : From some of the current usages of this class in unit tests, I noticed that when any one of the test cases tried to clean up the database and found a table created by any of the previously executed test cases, the clean up process tries to fetch the Table object and performs the sanity checks which include checking for required fields like InputFormat, OutputFormat, SerDeInfo.serializationLib of the table. The sanity checks fail which results in failure of the test case. Fix : In unit-tests, StorageDescriptor object should be created with the Fields that are sanity checked when trying to fetch the table. NOTE : This fix fixes 6 test cases in itests/hive-unit/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Status: Patch Available (was: Open) Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Assignee: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Attachment: HIVE-7201.04.patch Updated patch Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Assignee: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Affects Version/s: (was: 0.13.0) 0.13.1 Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.1 Reporter: Pankit Thapar Assignee: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.03.patch, HIVE-7201.04.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7228) StreamPrinter should be joined to calling thread
Pankit Thapar created HIVE-7228: --- Summary: StreamPrinter should be joined to calling thread Key: HIVE-7228 URL: https://issues.apache.org/jira/browse/HIVE-7228 Project: Hive Issue Type: Bug Components: CLI Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor ISSUE: StreamPrinter class is used for connecting an input stream (connected to output) of a process with the output stream of a Session (CliSessionState/SessionState class) It acts as a pipe between the two and transfers data from input stream to the output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. From some of the current usages of this class, I noticed that the calling threads do not wait for the transfer operation to be completed. That is, the calling thread does not join the SteamPrinter threads. The calling thread would move forward thinking that the respective output stream already has the data needed. But, it is not always the right assumption since, it might happen that the StreamPrinter thread did not finish execution by the time it was expected by the calling thread. FIX: To ensure that calling thread waits for the StreamPrinter threads to complete, StreamPrinter threads are joined to calling thread. Please note , without the fix, TestCliDriverMethods#testRun failed sometimes (like 1 in 30 times). This test would not fail with this fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Status: Patch Available (was: Open) Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.03.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7228) StreamPrinter should be joined to calling thread
[ https://issues.apache.org/jira/browse/HIVE-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7228: Attachment: HIVE-7228.patch Added join() to usages of StreamPrinter StreamPrinter should be joined to calling thread - Key: HIVE-7228 URL: https://issues.apache.org/jira/browse/HIVE-7228 Project: Hive Issue Type: Bug Components: CLI Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7228.patch ISSUE: StreamPrinter class is used for connecting an input stream (connected to output) of a process with the output stream of a Session (CliSessionState/SessionState class) It acts as a pipe between the two and transfers data from input stream to the output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. From some of the current usages of this class, I noticed that the calling threads do not wait for the transfer operation to be completed. That is, the calling thread does not join the SteamPrinter threads. The calling thread would move forward thinking that the respective output stream already has the data needed. But, it is not always the right assumption since, it might happen that the StreamPrinter thread did not finish execution by the time it was expected by the calling thread. FIX: To ensure that calling thread waits for the StreamPrinter threads to complete, StreamPrinter threads are joined to calling thread. Please note , without the fix, TestCliDriverMethods#testRun failed sometimes (like 1 in 30 times). This test would not fail with this fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7228) StreamPrinter should be joined to calling thread
[ https://issues.apache.org/jira/browse/HIVE-7228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7228: Status: Patch Available (was: Open) StreamPrinter should be joined to calling thread - Key: HIVE-7228 URL: https://issues.apache.org/jira/browse/HIVE-7228 Project: Hive Issue Type: Bug Components: CLI Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7228.patch ISSUE: StreamPrinter class is used for connecting an input stream (connected to output) of a process with the output stream of a Session (CliSessionState/SessionState class) It acts as a pipe between the two and transfers data from input stream to the output stream. THE TRANSFER OPERATION RUNS IN A SEPARATE THREAD. From some of the current usages of this class, I noticed that the calling threads do not wait for the transfer operation to be completed. That is, the calling thread does not join the SteamPrinter threads. The calling thread would move forward thinking that the respective output stream already has the data needed. But, it is not always the right assumption since, it might happen that the StreamPrinter thread did not finish execution by the time it was expected by the calling thread. FIX: To ensure that calling thread waits for the StreamPrinter threads to complete, StreamPrinter threads are joined to calling thread. Please note , without the fix, TestCliDriverMethods#testRun failed sometimes (like 1 in 30 times). This test would not fail with this fix. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Attachment: HIVE-7201-2.patch This is the correct patch. Previous was re based to trunk. This is the correct patch, re based to latest branch-0.13. Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Attachment: HIVE-7201.03.patch Renamed the patch to kick in the autobuild Rebaed the patch to trunk instead of branch-0.13 Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201-2.patch, HIVE-7201.03.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Attachment: HIVE-7201-1.patch The patch is rebased against the latest trunk for branch-13 Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Status: Patch Available (was: Open) The patch is rebased against the latest trunk for brach-13 Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201-1.patch, HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
Pankit Thapar created HIVE-7201: --- Summary: Fix TestHiveConf#testConfProperties test case Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Status: Patch Available (was: Open) Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7201) Fix TestHiveConf#testConfProperties test case
[ https://issues.apache.org/jira/browse/HIVE-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pankit Thapar updated HIVE-7201: Attachment: HIVE-7201.patch Fix TestHiveConf#testConfProperties test case - Key: HIVE-7201 URL: https://issues.apache.org/jira/browse/HIVE-7201 Project: Hive Issue Type: Bug Components: Tests Affects Versions: 0.13.0 Reporter: Pankit Thapar Priority: Minor Attachments: HIVE-7201.patch CHANGE 1: TEST CASE : The intention of TestHiveConf#testConfProperties() is to test the HiveConf properties being set in the priority as expected. Each HiveConf object is initialized as follows: 1) Hadoop configuration properties are applied. 2) ConfVar properties with non-null values are overlayed. 3) hive-site.xml properties are overlayed. ISSUE : The mapreduce related configurations are loaded by JobConf and not Configuration. The current test tries to get the configuration properties like : HADOOPNUMREDUCERS (mapred.job.reduces) from Configuration class. But these mapreduce related properties are loaded by JobConf class from mapred-default.xml. DETAILS : LINE 63 : checkHadoopConf(ConfVars.HADOOPNUMREDUCERS.varname, 1); --fails Because, private void checkHadoopConf(String name, String expectedHadoopVal) { Assert.assertEquals(expectedHadoopVal, new Configuration().get(name)); Second parameter is null, since its the JobConf class and not the Configuration class that initializes mapred-default values. } Code that loads mapreduce resources is in ConfigUtil and JobConf makes a call like this (in static block): public class JobConf extends Configuration { private static final Log LOG = LogFactory.getLog(JobConf.class); static{ ConfigUtil.loadResources(); -- loads mapreduce related resources (mapreduce-default.xml) } . } Please note, the test case assertion works fine if HiveConf() constructor is called before this assertion since, HiveConf() triggers JobConf() which basically sets the default values of the properties pertaining to mapreduce. This is why, there won't be any failures if testHiveSitePath() was run before testConfProperties() as that would load mapreduce properties into config properties. FIX: Instead of using a Configuration object, we can use the JobConf object to get the default values used by hadoop/mapreduce. CHANGE 2: In TestHiveConf#testHiveSitePath(), a call to static method getHiveSiteLocation() should be called statically instead of using an object. -- This message was sent by Atlassian JIRA (v6.2#6252)