[jira] Updated: (HIVE-1674) count(*) returns wrong result when a mapper returns empty results
[ https://issues.apache.org/jira/browse/HIVE-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1674: --- Resolution: Fixed Fix Version/s: 0.7.0 Status: Resolved (was: Patch Available) I just committed! Thanks Ning! count(*) returns wrong result when a mapper returns empty results - Key: HIVE-1674 URL: https://issues.apache.org/jira/browse/HIVE-1674 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Ning Zhang Fix For: 0.7.0 Attachments: HIVE-1674.patch select count(*) from src where false; will return # of mappers rather than 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1376) Simple UDAFs with more than 1 parameter crash on empty row query
[ https://issues.apache.org/jira/browse/HIVE-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918716#action_12918716 ] He Yongqiang commented on HIVE-1376: will take a look. Simple UDAFs with more than 1 parameter crash on empty row query - Key: HIVE-1376 URL: https://issues.apache.org/jira/browse/HIVE-1376 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Reporter: Mayank Lahiri Assignee: Ning Zhang Attachments: HIVE-1376.2.patch, HIVE-1376.patch Simple UDAFs with more than 1 parameter crash when the query returns no rows. Currently, this only seems to affect the percentile() UDAF where the second parameter is the percentile to be computed (of type double). I've also verified the bug by adding a dummy parameter to ExampleMin in contrib. On an empty query, Hive seems to be trying to resolve an iterate() method with signature {null,null} instead of {null,double}. You can reproduce this bug using: CREATE TABLE pct_test ( val INT ); SELECT percentile(val, 0.5) FROM pct_test; which produces a lot of errors like: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public boolean org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator.iterate(org.apache.hadoop.io.LongWritable,double) on object org.apache.hadoop.hive.ql.udf.udafpercentile$percentilelongevalua...@11d13272 of class org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator with arguments {null, null} of size 2 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1641) add map joined table to distributed cache
[ https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918764#action_12918764 ] He Yongqiang commented on HIVE-1641: There are 2 patches with the same name. Can you delete the older one? And when uploading a patch, pls rename the patch to hive-jiranumber.patchnumberordate.patch. add map joined table to distributed cache - Key: HIVE-1641 URL: https://issues.apache.org/jira/browse/HIVE-1641 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Affects Versions: 0.7.0 Reporter: Namit Jain Assignee: Liyin Tang Fix For: 0.7.0 Attachments: Hive-1641.patch, Hive-1641.patch Currently, the mappers directly read the map-joined table from HDFS, which makes it difficult to scale. We end up getting lots of timeouts once the number of mappers are beyond a few thousand, due to concurrent mappers. It would be good idea to put the mapped file into distributed cache and read from there instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1376) Simple UDAFs with more than 1 parameter crash on empty row query
[ https://issues.apache.org/jira/browse/HIVE-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918767#action_12918767 ] He Yongqiang commented on HIVE-1376: the patch looks good. is there the same problem in other udafs? If yes, should we fix them one by one or fix them in the group by operator? Simple UDAFs with more than 1 parameter crash on empty row query - Key: HIVE-1376 URL: https://issues.apache.org/jira/browse/HIVE-1376 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Reporter: Mayank Lahiri Assignee: Ning Zhang Attachments: HIVE-1376.2.patch, HIVE-1376.patch Simple UDAFs with more than 1 parameter crash when the query returns no rows. Currently, this only seems to affect the percentile() UDAF where the second parameter is the percentile to be computed (of type double). I've also verified the bug by adding a dummy parameter to ExampleMin in contrib. On an empty query, Hive seems to be trying to resolve an iterate() method with signature {null,null} instead of {null,double}. You can reproduce this bug using: CREATE TABLE pct_test ( val INT ); SELECT percentile(val, 0.5) FROM pct_test; which produces a lot of errors like: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public boolean org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator.iterate(org.apache.hadoop.io.LongWritable,double) on object org.apache.hadoop.hive.ql.udf.udafpercentile$percentilelongevalua...@11d13272 of class org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator with arguments {null, null} of size 2 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1376) Simple UDAFs with more than 1 parameter crash on empty row query
[ https://issues.apache.org/jira/browse/HIVE-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918771#action_12918771 ] He Yongqiang commented on HIVE-1376: sorry, did not see the previous comments. John and Zheng have already discussed this problem. I will start running tests. Simple UDAFs with more than 1 parameter crash on empty row query - Key: HIVE-1376 URL: https://issues.apache.org/jira/browse/HIVE-1376 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Reporter: Mayank Lahiri Assignee: Ning Zhang Attachments: HIVE-1376.2.patch, HIVE-1376.patch Simple UDAFs with more than 1 parameter crash when the query returns no rows. Currently, this only seems to affect the percentile() UDAF where the second parameter is the percentile to be computed (of type double). I've also verified the bug by adding a dummy parameter to ExampleMin in contrib. On an empty query, Hive seems to be trying to resolve an iterate() method with signature {null,null} instead of {null,double}. You can reproduce this bug using: CREATE TABLE pct_test ( val INT ); SELECT percentile(val, 0.5) FROM pct_test; which produces a lot of errors like: Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public boolean org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator.iterate(org.apache.hadoop.io.LongWritable,double) on object org.apache.hadoop.hive.ql.udf.udafpercentile$percentilelongevalua...@11d13272 of class org.apache.hadoop.hive.ql.udf.UDAFPercentile$PercentileLongEvaluator with arguments {null, null} of size 2 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1674) count(*) returns wrong result when a mapper returns empty results
[ https://issues.apache.org/jira/browse/HIVE-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12918318#action_12918318 ] He Yongqiang commented on HIVE-1674: +1. running test. count(*) returns wrong result when a mapper returns empty results - Key: HIVE-1674 URL: https://issues.apache.org/jira/browse/HIVE-1674 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-1674.patch select count(*) from src where false; will return # of mappers rather than 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1658) Fix describe [extended] column formatting
[ https://issues.apache.org/jira/browse/HIVE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917725#action_12917725 ] He Yongqiang commented on HIVE-1658: +1. Looks good. Can you do the final patch? Fix describe [extended] column formatting - Key: HIVE-1658 URL: https://issues.apache.org/jira/browse/HIVE-1658 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: Paul Yang Assignee: Thiruvel Thirumoolan Attachments: HIVE-1658-PrelimPatch.patch When displaying the column schema, the formatting should follow should be nameTABtypeTABcommentNEWLINE to be inline with the previous formatting style for backward compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1658) Fix describe [extended] column formatting
[ https://issues.apache.org/jira/browse/HIVE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917771#action_12917771 ] He Yongqiang commented on HIVE-1658: one more thing, if the time information (create time, last access time etc) is 0, can you put some string like unknown to the output of desc format? Fix describe [extended] column formatting - Key: HIVE-1658 URL: https://issues.apache.org/jira/browse/HIVE-1658 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: Paul Yang Assignee: Thiruvel Thirumoolan Attachments: HIVE-1658-PrelimPatch.patch When displaying the column schema, the formatting should follow should be nameTABtypeTABcommentNEWLINE to be inline with the previous formatting style for backward compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1674) count(*) returns wrong result when a mapper returns empty results
[ https://issues.apache.org/jira/browse/HIVE-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917856#action_12917856 ] He Yongqiang commented on HIVE-1674: will take a look. count(*) returns wrong result when a mapper returns empty results - Key: HIVE-1674 URL: https://issues.apache.org/jira/browse/HIVE-1674 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-1674.patch select count(*) from src where false; will return # of mappers rather than 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1676) show table extended like does not work well with wildcards
[ https://issues.apache.org/jira/browse/HIVE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916583#action_12916583 ] He Yongqiang commented on HIVE-1676: needs to use ``, not quote. show table extended like does not work well with wildcards -- Key: HIVE-1676 URL: https://issues.apache.org/jira/browse/HIVE-1676 Project: Hadoop Hive Issue Type: Bug Reporter: Pradeep Kamath Priority: Minor As evident from the output below though there are tables that match the wildcard, the output from show table extended like does not contain the matches. {noformat} bin/hive -e show tables 'foo*' Hive history file=/tmp/pradeepk/hive_job_log_pradeepk_201009301037_568707409.txt OK foo foo2 Time taken: 3.417 seconds bin/hive -e show table extended like 'foo*' Hive history file=/tmp/pradeepk/hive_job_log_pradeepk_201009301037_410056681.txt OK Time taken: 2.948 seconds {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1673) Create table bug causes the row format property lost when serde is specified.
[ https://issues.apache.org/jira/browse/HIVE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916590#action_12916590 ] He Yongqiang commented on HIVE-1673: Just tried again, the same tests succeeded in my box. Can you post your diff for those testcases? Create table bug causes the row format property lost when serde is specified. - Key: HIVE-1673 URL: https://issues.apache.org/jira/browse/HIVE-1673 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1673.1.patch An example: create table src_rc_serde_yongqiang(key string, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\0' stored as rcfile; will lost the row format information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1647) Incorrect initialization of thread local variable inside IOContext ( implementation is not threadsafe )
[ https://issues.apache.org/jira/browse/HIVE-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916676#action_12916676 ] He Yongqiang commented on HIVE-1647: +1 running tests Incorrect initialization of thread local variable inside IOContext ( implementation is not threadsafe ) Key: HIVE-1647 URL: https://issues.apache.org/jira/browse/HIVE-1647 Project: Hadoop Hive Issue Type: Bug Components: Server Infrastructure Affects Versions: 0.6.0, 0.7.0 Reporter: Raman Grover Assignee: Liyin Tang Fix For: 0.7.0 Attachments: HIVE-1647.patch Original Estimate: 0.17h Remaining Estimate: 0.17h Bug in org.apache.hadoop.hive.ql.io.IOContext in relation to initialization of thread local variable. public class IOContext { private static ThreadLocalIOContext threadLocal = new ThreadLocalIOContext(){ }; static { if (threadLocal.get() == null) { threadLocal.set(new IOContext()); } } In a multi-threaded environment, the thread that gets to load the class first for the JVM (assuming threads share the classloader), gets to initialize itself correctly by executing the code in the static block. Once the class is loaded, any subsequent threads would have their respective threadlocal variable as null. Since IOContext is set during initialization of HiveRecordReader, In a scenario where multiple threads get to acquire an instance of HiveRecordReader, it would result in a NPE for all but the first thread that gets to load the class in the VM. Is the above scenario of multiple threads initializing HiveRecordReader a typical one ? or we could just provide the following fix... private static ThreadLocalIOContext threadLocal = new ThreadLocalIOContext(){ protected synchronized IOContext initialValue() { return new IOContext(); } }; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang resolved HIVE-1624. Fix Version/s: 0.7.0 Resolution: Fixed I just committed! Thanks Vaibhav Aggarwal! Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Fix For: 0.7.0 Attachments: HIVE-1624-2.patch, HIVE-1624-3.patch, HIVE-1624-4.patch, HIVE-1624-5.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1647) Incorrect initialization of thread local variable inside IOContext ( implementation is not threadsafe )
[ https://issues.apache.org/jira/browse/HIVE-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916712#action_12916712 ] He Yongqiang commented on HIVE-1647: The test is still running, but there are a lot of diffs. Can you take a look? Examples like join_map_ppr.q, input_part10.q You can use ' ant test -Dtestcase=TestCliDriver -Dqfile=join_map_ppr.q,input_part10.q ' to reproduce. Incorrect initialization of thread local variable inside IOContext ( implementation is not threadsafe ) Key: HIVE-1647 URL: https://issues.apache.org/jira/browse/HIVE-1647 Project: Hadoop Hive Issue Type: Bug Components: Server Infrastructure Affects Versions: 0.6.0, 0.7.0 Reporter: Raman Grover Assignee: Liyin Tang Fix For: 0.7.0 Attachments: HIVE-1647.patch Original Estimate: 0.17h Remaining Estimate: 0.17h Bug in org.apache.hadoop.hive.ql.io.IOContext in relation to initialization of thread local variable. public class IOContext { private static ThreadLocalIOContext threadLocal = new ThreadLocalIOContext(){ }; static { if (threadLocal.get() == null) { threadLocal.set(new IOContext()); } } In a multi-threaded environment, the thread that gets to load the class first for the JVM (assuming threads share the classloader), gets to initialize itself correctly by executing the code in the static block. Once the class is loaded, any subsequent threads would have their respective threadlocal variable as null. Since IOContext is set during initialization of HiveRecordReader, In a scenario where multiple threads get to acquire an instance of HiveRecordReader, it would result in a NPE for all but the first thread that gets to load the class in the VM. Is the above scenario of multiple threads initializing HiveRecordReader a typical one ? or we could just provide the following fix... private static ThreadLocalIOContext threadLocal = new ThreadLocalIOContext(){ protected synchronized IOContext initialValue() { return new IOContext(); } }; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1677) revert changes made by HIVE-558
revert changes made by HIVE-558 --- Key: HIVE-1677 URL: https://issues.apache.org/jira/browse/HIVE-1677 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang There is another jira (https://issues.apache.org/jira/browse/HIVE-1658) going on to do a better fix for HIVE-558. If HIVE-1658 can not be patched in a timely fashion, we can first revert HIVE-558 for now. So that it will not be a blocker for releasing etc. This is just the bottom line. Please feel free to close this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1677) revert changes made by HIVE-558
[ https://issues.apache.org/jira/browse/HIVE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang resolved HIVE-1677. Resolution: Invalid Had an offline discussion with namit, reverting will need a lot of changes in the log file. It is not a good way. Let's first do a simple fix in HIVE-1658, and the do the pretty describe in another diff. revert changes made by HIVE-558 --- Key: HIVE-1677 URL: https://issues.apache.org/jira/browse/HIVE-1677 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang There is another jira (https://issues.apache.org/jira/browse/HIVE-1658) going on to do a better fix for HIVE-558. If HIVE-1658 can not be patched in a timely fashion, we can first revert HIVE-558 for now. So that it will not be a blocker for releasing etc. This is just the bottom line. Please feel free to close this jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1665) drop operations may cause file leak
[ https://issues.apache.org/jira/browse/HIVE-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1665: --- Attachment: hive-1665.1.patch drop operations may cause file leak --- Key: HIVE-1665 URL: https://issues.apache.org/jira/browse/HIVE-1665 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1665.1.patch Right now when doing a drop, Hive first drops metadata and then drops the actual files. If file system is down at that time, the files will keep not deleted. Had an offline discussion about this: to fix this, add a new conf scratch dir into hive conf. when doing a drop operation: 1) move data to scratch directory 2) drop metadata 3) if 2) failed, roll back 1) and report error 3.1 if 2) succeeded, drop data from scratch directory 3.2 4) if 3.2 fails, we are ok because we assume the scratch dir will be emptied manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1673) Create table bug causes the row format property lost when serde is specified.
Create table bug causes the row format property lost when serde is specified. - Key: HIVE-1673 URL: https://issues.apache.org/jira/browse/HIVE-1673 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang An example: create table src_rc_serde_yongqiang(key string, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\0' stored as rcfile; will lost the row format information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1673) Create table bug causes the row format property lost when serde is specified.
[ https://issues.apache.org/jira/browse/HIVE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1673: --- Status: Patch Available (was: Open) Affects Version/s: 0.7.0 Fix Version/s: 0.7.0 Create table bug causes the row format property lost when serde is specified. - Key: HIVE-1673 URL: https://issues.apache.org/jira/browse/HIVE-1673 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1673.1.patch An example: create table src_rc_serde_yongqiang(key string, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\0' stored as rcfile; will lost the row format information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1673) Create table bug causes the row format property lost when serde is specified.
[ https://issues.apache.org/jira/browse/HIVE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1673: -- Assignee: He Yongqiang Create table bug causes the row format property lost when serde is specified. - Key: HIVE-1673 URL: https://issues.apache.org/jira/browse/HIVE-1673 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1673.1.patch An example: create table src_rc_serde_yongqiang(key string, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\0' stored as rcfile; will lost the row format information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1673) Create table bug causes the row format property lost when serde is specified.
[ https://issues.apache.org/jira/browse/HIVE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1673: --- Attachment: hive-1673.1.patch Create table bug causes the row format property lost when serde is specified. - Key: HIVE-1673 URL: https://issues.apache.org/jira/browse/HIVE-1673 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1673.1.patch An example: create table src_rc_serde_yongqiang(key string, value string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\0' stored as rcfile; will lost the row format information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1665) drop operations may cause file leak
[ https://issues.apache.org/jira/browse/HIVE-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12916329#action_12916329 ] He Yongqiang commented on HIVE-1665: If 2 failed and rolling back 1) also failed, then the data is in trash scratch dir and the table's metadata is there. But 2 failed and rolling back 1) also failed will rarely happen. Most concern here is to deal with hdfs down and housekeeping operations. For 'mark-then-delete', I think the main problem is there is no administration daemon process or helper script for it. drop operations may cause file leak --- Key: HIVE-1665 URL: https://issues.apache.org/jira/browse/HIVE-1665 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1665.1.patch Right now when doing a drop, Hive first drops metadata and then drops the actual files. If file system is down at that time, the files will keep not deleted. Had an offline discussion about this: to fix this, add a new conf scratch dir into hive conf. when doing a drop operation: 1) move data to scratch directory 2) drop metadata 3) if 2) failed, roll back 1) and report error 3.1 if 2) succeeded, drop data from scratch directory 3.2 4) if 3.2 fails, we are ok because we assume the scratch dir will be emptied manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915942#action_12915942 ] He Yongqiang commented on HIVE-1624: Mostly look good. In your testcase, can you put the new script file in new Path(System.getProperty(test.data.dir, .) + file name) ? By move fetchFilesNotInLocalFilesystem to SessionState, you can keep getScriptProgName() ect in SemanticAnalyer by changing fetchFilesNotInLocalFilesystem's arguments to pass in the command etc. I am also ok with current way. Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Attachments: HIVE-1624-2.patch, HIVE-1624-3.patch, HIVE-1624-4.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915391#action_12915391 ] He Yongqiang commented on HIVE-1624: Great. some nitpick, sorry for not posting them in the previous comment. 1) It seems there is still one logging code + getConsole().printInfo(Testing + value); 2) Also can you add one junit test for DosToUnix? 3) Do you think it maybe better to move fetchFilesNotInLocalFilesystem to SessionState? Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Attachments: HIVE-1624-2.patch, HIVE-1624-3.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1361) table/partition level statistics
[ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1361: --- Status: Resolved (was: Patch Available) Resolution: Fixed I just committed! Thanks Ning and Ahmed! table/partition level statistics Key: HIVE-1361 URL: https://issues.apache.org/jira/browse/HIVE-1361 Project: Hadoop Hive Issue Type: Sub-task Components: Query Processor Reporter: Ning Zhang Assignee: Ahmed M Aly Fix For: 0.7.0 Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.5.java_only.patch, HIVE-1361.5.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. There are 3 major milestones in this subtask: 1) extend the insert statement to gather table/partition level stats on-the-fly. 2) extend metastore API to support storing and retrieving stats for a particular table/partition. 3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. The proposed stats are: Partition-level stats: - number of rows - total size in bytes - number of files - max, min, average row sizes - max, min, average file sizes Table-level stats in addition to partition level stats: - number of partitions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1663) ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty
[ https://issues.apache.org/jira/browse/HIVE-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang resolved HIVE-1663. Fix Version/s: 0.7.0 Resolution: Fixed fixed. ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty -- Key: HIVE-1663 URL: https://issues.apache.org/jira/browse/HIVE-1663 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 we should remove this empty file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1663) ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty
[ https://issues.apache.org/jira/browse/HIVE-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915435#action_12915435 ] He Yongqiang commented on HIVE-1663: Sorry that I committed this myself. i can not generate a patch for it (this file is empty). ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty -- Key: HIVE-1663 URL: https://issues.apache.org/jira/browse/HIVE-1663 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Fix For: 0.7.0 we should remove this empty file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1670) MapJoin throws EOFExeption when the mapjoined table has 0 column selected
[ https://issues.apache.org/jira/browse/HIVE-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914762#action_12914762 ] He Yongqiang commented on HIVE-1670: Is this the same as https://issues.apache.org/jira/browse/HIVE-1452? If yes, we can close HIVE-1452 since it is fixed here. MapJoin throws EOFExeption when the mapjoined table has 0 column selected - Key: HIVE-1670 URL: https://issues.apache.org/jira/browse/HIVE-1670 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Ning Zhang Attachments: HIVE-1670.patch select /*+mapjoin(b) */ sum(a.key) from src a join src b on (a.key=b.key); throws EOFException -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1361) table/partition level statistics
[ https://issues.apache.org/jira/browse/HIVE-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914728#action_12914728 ] He Yongqiang commented on HIVE-1361: +1 running tests. table/partition level statistics Key: HIVE-1361 URL: https://issues.apache.org/jira/browse/HIVE-1361 Project: Hadoop Hive Issue Type: Sub-task Components: Query Processor Reporter: Ning Zhang Assignee: Ahmed M Aly Fix For: 0.7.0 Attachments: HIVE-1361.2.patch, HIVE-1361.2_java_only.patch, HIVE-1361.3.patch, HIVE-1361.4.java_only.patch, HIVE-1361.4.patch, HIVE-1361.5.java_only.patch, HIVE-1361.5.patch, HIVE-1361.java_only.patch, HIVE-1361.patch, stats0.patch At the first step, we gather table-level stats for non-partitioned table and partition-level stats for partitioned table. Future work could extend the table level stats to partitioned table as well. There are 3 major milestones in this subtask: 1) extend the insert statement to gather table/partition level stats on-the-fly. 2) extend metastore API to support storing and retrieving stats for a particular table/partition. 3) add an ANALYZE TABLE [PARTITION] statement in Hive QL to gather stats for existing tables/partitions. The proposed stats are: Partition-level stats: - number of rows - total size in bytes - number of files - max, min, average row sizes - max, min, average file sizes Table-level stats in addition to partition level stats: - number of partitions -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1669) non-deterministic display of storage parameter in test
[ https://issues.apache.org/jira/browse/HIVE-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914727#action_12914727 ] He Yongqiang commented on HIVE-1669: Ning, Can you post a fix for this after i commit the statistic jira? (HIVE-1361) non-deterministic display of storage parameter in test -- Key: HIVE-1669 URL: https://issues.apache.org/jira/browse/HIVE-1669 Project: Hadoop Hive Issue Type: Test Reporter: Ning Zhang With the change to beautify the 'desc extended table', the storage parameters are displayed in non-deterministic manner (since its implementation is HashMap). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1661) Default values for parameters
[ https://issues.apache.org/jira/browse/HIVE-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1661: --- Status: Resolved (was: Patch Available) Resolution: Fixed committed! Thanks Siying! Default values for parameters - Key: HIVE-1661 URL: https://issues.apache.org/jira/browse/HIVE-1661 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong Fix For: 0.7.0 Attachments: HIVE-1661.1.patch, HIVE-1661.2.patch It would be good to have a default value for some hive parameters: say RETENTION to be 30 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914176#action_12914176 ] He Yongqiang commented on HIVE-1624: Should I modify it to be hdfs://anything || s3://anything like path? Yes. That will be a great start. We can add more if needed in future. Also please make sure if a program, neither hdfs nor s3 , can not be found locally, the query should not fail in semantic analyzer. Otherwise, it may break a lot of existing queries. Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Attachments: HIVE-1624-2.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1665) drop operations may cause file leak
drop operations may cause file leak --- Key: HIVE-1665 URL: https://issues.apache.org/jira/browse/HIVE-1665 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Right now when doing a drop, Hive first drops metadata and then drops the actual files. If file system is down at that time, the files will keep not deleted. Had an offline discussion about this: to fix this, add a new conf scratch dir into hive conf. when doing a drop operation: 1) move data to scratch directory 2) drop metadata 3) if 2) failed, roll back 1) and report error 3.1 if 2) succeeded, drop data from scratch directory 3.2 4) if 3.2 fails, we are ok because we assume the scratch dir will be emptied manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913878#action_12913878 ] He Yongqiang commented on HIVE-1624: looks good basically. need to remove some unneeded logging information one main problem here is to determine when to download file. We can not simply try downloading file when can not be found in local. Sometimes scripts exist in some remote dir that the hadoop cluster nodes can access but the client can not. Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Attachments: HIVE-1624-2.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913895#action_12913895 ] He Yongqiang commented on HIVE-1624: For 2, sometimes it is actually a common case. For example, User can use php but no need to have php program in local. We can add some simple rule for downloading resource files, such as starts with s3 schema in this case. Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Assignee: Vaibhav Aggarwal Attachments: HIVE-1624-2.patch, HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913230#action_12913230 ] He Yongqiang commented on HIVE-1633: Amareshwari, by adding a testcase in TestHiveFileFormatUtils, you will be able to find out the underlying problem, and then can you post a patch for it? CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1609) Support partition filtering in metastore
[ https://issues.apache.org/jira/browse/HIVE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913240#action_12913240 ] He Yongqiang commented on HIVE-1609: [by several partition functions in my previous comment, i mean the existing partition functions.] So just want to make sure the ones added in this jira will work finely for python client. @john, pls go ahead commit this. This is a really good one to have. We can fix problems later if there are any. Support partition filtering in metastore Key: HIVE-1609 URL: https://issues.apache.org/jira/browse/HIVE-1609 Project: Hadoop Hive Issue Type: New Feature Components: Metastore Reporter: Ajay Kidave Assignee: Ajay Kidave Fix For: 0.7.0 Attachments: hive_1609.patch, hive_1609_2.patch, hive_1609_3.patch The metastore needs to have support for returning a list of partitions based on user specified filter conditions. This will be useful for tools which need to do partition pruning. Howl is one such use case. The way partition pruning is done during hive query execution need not be changed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1662) Add file pruning into Hive.
Add file pruning into Hive. --- Key: HIVE-1662 URL: https://issues.apache.org/jira/browse/HIVE-1662 Project: Hadoop Hive Issue Type: New Feature Reporter: He Yongqiang now hive support filename virtual column. if a file name filter presents in a query, hive should be able to only add files which passed the filter to input paths. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1663) ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty
ql/src/java/org/apache/hadoop/hive/ql/parse/SamplePruner.java is empty -- Key: HIVE-1663 URL: https://issues.apache.org/jira/browse/HIVE-1663 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang we should remove this empty file -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912795#action_12912795 ] He Yongqiang commented on HIVE-1633: For a given path, CombineHiveInputFormat does recursive lookup in partToPartitionInfo. If no match found, will lookup for the parent dir (hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1) in partToPartitionInfo. In your case, it seems the parent dir exist in partToPartitionInfo. CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1650) TestContribNegativeCliDriver fails
[ https://issues.apache.org/jira/browse/HIVE-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910648#action_12910648 ] He Yongqiang commented on HIVE-1650: +1, running tests. TestContribNegativeCliDriver fails -- Key: HIVE-1650 URL: https://issues.apache.org/jira/browse/HIVE-1650 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1650.1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1650) TestContribNegativeCliDriver fails
[ https://issues.apache.org/jira/browse/HIVE-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1650: --- Status: Resolved (was: Patch Available) Resolution: Fixed I just committed! Thanks Namit! TestContribNegativeCliDriver fails -- Key: HIVE-1650 URL: https://issues.apache.org/jira/browse/HIVE-1650 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1650.1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1226) support filter pushdown against non-native tables
[ https://issues.apache.org/jira/browse/HIVE-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1226: --- Status: Resolved (was: Patch Available) Resolution: Fixed I just committed! Thanks John! support filter pushdown against non-native tables - Key: HIVE-1226 URL: https://issues.apache.org/jira/browse/HIVE-1226 Project: Hadoop Hive Issue Type: Improvement Components: HBase Handler, Query Processor Affects Versions: 0.6.0 Reporter: John Sichi Assignee: John Sichi Fix For: 0.7.0 Attachments: HIVE-1226.1.patch, HIVE-1226.2.patch, HIVE-1226.3.patch, HIVE-1226.4.patch For example, HBase's scan object can take filters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1645) ability to specify parent directory for zookeeper lock manager
[ https://issues.apache.org/jira/browse/HIVE-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang resolved HIVE-1645. Resolution: Fixed I just committed! Thanks Namit! ability to specify parent directory for zookeeper lock manager -- Key: HIVE-1645 URL: https://issues.apache.org/jira/browse/HIVE-1645 Project: Hadoop Hive Issue Type: Improvement Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1645.1.patch For concurrency support, it would be desirable if all the locks were created under a common parent, so that zookeeper can be used for different purposes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910255#action_12910255 ] He Yongqiang commented on HIVE-1633: Can you search hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1 (replacing xxx with actual file/host names)? It should appear one time in partToPartitionInfo and another one time in hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1/emptyFile. CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910431#action_12910431 ] He Yongqiang commented on HIVE-1633: so 'xxx' part is not the same in hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1/ and hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1/emptyFile ? CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909711#action_12909711 ] He Yongqiang commented on HIVE-1633: @Amareshwari in your example: hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1/emptyFile in partToPartitionInfo: [xxx..., xxx..., xxx..., ... hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/1, hdfs://xxx/.../hive_2010-09-07_12-15-00_299_4877141498303008976/-mr-10002/2] If i put these into TestHiveFormatUtils, it can return correct value. Maybe there is some mismatch about 'xxx'? CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1226) support filter pushdown against non-native tables
[ https://issues.apache.org/jira/browse/HIVE-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909878#action_12909878 ] He Yongqiang commented on HIVE-1226: The patch looks good. One question: In HBaseStorageHandler, it will exit if searchConditions.size() != 1. This makes sense if there are two point predicates on the key column (connected by 'AND'). What if they are composed to perform a range query (like akeyb) ? Can you also open another jira for indexing to leverage this change? Indexing needs this change to do automatic rewrite of user's query. support filter pushdown against non-native tables - Key: HIVE-1226 URL: https://issues.apache.org/jira/browse/HIVE-1226 Project: Hadoop Hive Issue Type: Improvement Components: HBase Handler, Query Processor Affects Versions: 0.6.0 Reporter: John Sichi Assignee: John Sichi Fix For: 0.7.0 Attachments: HIVE-1226.1.patch, HIVE-1226.2.patch, HIVE-1226.3.patch, HIVE-1226.4.patch For example, HBase's scan object can take filters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1645) ability to specify parent directory for zookeeper lock manager
[ https://issues.apache.org/jira/browse/HIVE-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909992#action_12909992 ] He Yongqiang commented on HIVE-1645: +1, running tests. ability to specify parent directory for zookeeper lock manager -- Key: HIVE-1645 URL: https://issues.apache.org/jira/browse/HIVE-1645 Project: Hadoop Hive Issue Type: Improvement Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1645.1.patch For concurrency support, it would be desirable if all the locks were created under a common parent, so that zookeeper can be used for different purposes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1633) CombineHiveInputFormat fails with cannot find dir for emptyFile
[ https://issues.apache.org/jira/browse/HIVE-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908716#action_12908716 ] He Yongqiang commented on HIVE-1633: Amareshwari, more details about your example? From your example, i can not reproduce the problem. CombineHiveInputFormat fails with cannot find dir for emptyFile - Key: HIVE-1633 URL: https://issues.apache.org/jira/browse/HIVE-1633 Project: Hadoop Hive Issue Type: Bug Components: Clients Reporter: Amareshwari Sriramadasu -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1624) Patch to allows scripts in S3 location
[ https://issues.apache.org/jira/browse/HIVE-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908729#action_12908729 ] He Yongqiang commented on HIVE-1624: S3 - client - cluster maybe better than directly downloading the script from S3 to TaskTracker node. There may be thousands of concurrent downloading request to S3 for downloading a script. (I agree that the script can be cached in local machine, but right now hive does not do any cache clean up). S3 - client - cluster will be able to use hadoop distributed cache. Patch to allows scripts in S3 location -- Key: HIVE-1624 URL: https://issues.apache.org/jira/browse/HIVE-1624 Project: Hadoop Hive Issue Type: New Feature Reporter: Vaibhav Aggarwal Attachments: HIVE-1624.patch I want to submit a patch which allows user to run scripts located in S3. This patch enables Hive to download the hive scripts located in S3 buckets and execute them. This saves users the effort of copying scripts to HDFS before executing them. Thanks Vaibhav -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException
[ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906837#action_12906837 ] He Yongqiang commented on HIVE-1610: Sammy, we can not fix this issue by just removing the schema check. If the input URI's path part is the same with one partition's path, but their schema is different, we should still return NULL. For your case, the main problem is the port, which is contained in the partitionDesc but not in the input path. Is it possible if we just ignore the port? I mean is there a case that two different instances share the same address but use different port? Using CombinedHiveInputFormat causes partToPartitionInfo IOException -- Key: HIVE-1610 URL: https://issues.apache.org/jira/browse/HIVE-1610 Project: Hadoop Hive Issue Type: Bug Environment: Hadoop 0.20.2 Reporter: Sammy Yu Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch I have a relatively complicated hive query using CombinedHiveInputFormat: set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=300; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type, keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank; This query use to work fine until I updated to r991183 on trunk and started getting this error: java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0 in partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.init(CombineHiveInputFormat.java:100) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108) This query works if I don't change the hive.input.format. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; I've narrowed down this issue to the commit for HIVE-1510. If I take out the changeset from r987746, everything
[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException
[ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907058#action_12907058 ] He Yongqiang commented on HIVE-1610: Sammy, there are mainly 2 problems. 1) going over the map is not efficient, and 2) using startWith to do prefix match is a bug fixed in HIVE-1510. Sammy, can you change the logic as follows: right now, hive generates another pathToPartitionInfo map by removing the path's schema information, and put it in a cacheMap. We can keep the same logic but change the new pathToPartitionInfo map's value to be an array of PartitionDesc. And then we can just remove the schema check, and once we get a match, we go through the array of PartitionDesc to find the best one. this can also solve another problem. If there are 2 partitionDesc which's path part is same but the schema is different, only one is contained in the new pathToPartitionInfo map. About how to go through the array of PartitionDesc to find the best one: if the array contains only 1 element, return array.get(0); 1) if the original input does not have any schema information: if the array contains more then 1 element, report error. 2) if the original input contains schema information: 1) if the array contains an element which's the exact match (also contains schema and port, and the same with input), return the exact match. 2) ignore port part but keep the schema and address, and go through the array. what do you think? Using CombinedHiveInputFormat causes partToPartitionInfo IOException -- Key: HIVE-1610 URL: https://issues.apache.org/jira/browse/HIVE-1610 Project: Hadoop Hive Issue Type: Bug Environment: Hadoop 0.20.2 Reporter: Sammy Yu Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch, 0004-hive.patch I have a relatively complicated hive query using CombinedHiveInputFormat: set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=300; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type, keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank; This query use to work fine until I updated to r991183 on trunk and started getting this error: java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0 in partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277) at
[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException
[ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905751#action_12905751 ] He Yongqiang commented on HIVE-1610: Sammy, the only change in TestHiveFileFormatUtils is to remove URI scheme checks (1 line change). You actually added some lines of code which were removed by HIVE-1510, and this is the reason the testcase fails. Using CombinedHiveInputFormat causes partToPartitionInfo IOException -- Key: HIVE-1610 URL: https://issues.apache.org/jira/browse/HIVE-1610 Project: Hadoop Hive Issue Type: Bug Environment: Hadoop 0.20.2 Reporter: Sammy Yu Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch I have a relatively complicated hive query using CombinedHiveInputFormat: set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=300; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type, keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank; This query use to work fine until I updated to r991183 on trunk and started getting this error: java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/00_0 in partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829, hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.init(CombineHiveInputFormat.java:100) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610) at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108) This query works if I don't change the hive.input.format. set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; I've narrowed down this issue to the commit for HIVE-1510. If I take out the changeset from r987746, everything works as before. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901503#action_12901503 ] He Yongqiang commented on HIVE-741: --- +1. The patch looks good to me. (Only have one minor comment on the name of hasNullElements, should we rename it since this function is used to determine all keys are null?) NULL is not handled correctly in join - Key: HIVE-741 URL: https://issues.apache.org/jira/browse/HIVE-741 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Amareshwari Sriramadasu Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt With the following data in table input4_cb: KeyValue -- NULL 325 18 NULL The following query: {code} select * from input4_cb a join input4_cb b on a.key = b.value; {code} returns the following result: NULL32518 NULL The correct result should be empty set. When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901508#action_12901508 ] He Yongqiang commented on HIVE-741: --- also about Ning's comments: 2) SMBMapJoinOperator.compareKey() is called for each row so it is critical for performance. In your code the hasNullElement() could be called 4 times in the worse case. If you cache the result it can be called only twice. Agree. Not sure how much overhead is there, will try to estimate the overhead over production running. That will be great if you can try to cache the null check results, so that it can only happen one time for each key. NULL is not handled correctly in join - Key: HIVE-741 URL: https://issues.apache.org/jira/browse/HIVE-741 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Amareshwari Sriramadasu Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt With the following data in table input4_cb: KeyValue -- NULL 325 18 NULL The following query: {code} select * from input4_cb a join input4_cb b on a.key = b.value; {code} returns the following result: NULL32518 NULL The correct result should be empty set. When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1584) wrong log files in contrib client positive
[ https://issues.apache.org/jira/browse/HIVE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1584: --- Status: Resolved (was: Patch Available) Resolution: Fixed I just committed! Thanks Namit! wrong log files in contrib client positive -- Key: HIVE-1584 URL: https://issues.apache.org/jira/browse/HIVE-1584 Project: Hadoop Hive Issue Type: Bug Components: Testing Infrastructure Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1584.1.patch TestContribCliDriver still gets some diffs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1452) Mapside join on non partitioned table with partitioned table causes error
[ https://issues.apache.org/jira/browse/HIVE-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1452: -- Assignee: Thiruvel Thirumoolan Great! Assigned to Thiruvel. I think he is already in the contributor list, and he can just assign jiras to himself now. Mapside join on non partitioned table with partitioned table causes error - Key: HIVE-1452 URL: https://issues.apache.org/jira/browse/HIVE-1452 Project: Hadoop Hive Issue Type: Bug Components: CLI Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Thiruvel Thirumoolan I am running script which contains two tables, one is dynamically partitioned and stored as RCFormat and the other is stored as TXT file. The TXT file has around 397MB in size and has around 24million rows. {code} drop table joinquery; create external table joinquery ( id string, type string, sec string, num string, url string, cost string, listinfo array mapstring,string ) STORED AS TEXTFILE LOCATION '/projects/joinquery'; CREATE EXTERNAL TABLE idtable20mil( id string ) STORED AS TEXTFILE LOCATION '/projects/idtable20mil'; insert overwrite table joinquery select /*+ MAPJOIN(idtable20mil) */ rctable.id, rctable.type, rctable.map['sec'], rctable.map['num'], rctable.map['url'], rctable.map['cost'], rctable.listinfo from rctable JOIN idtable20mil on (rctable.id = idtable20mil.id) where rctable.id is not null and rctable.part='value' and rctable.subpart='value'and rctable.pty='100' and rctable.uniqid='1000' order by id; {code} Result: Possible error: Data file split:string,part:string,subpart:string,subsubpart:stringgt; is corrupted. Solution: Replace file. i.e. by re-running the query that produced the source table / partition. - If I look at mapper logs. {verbatim} Caused by: java.io.IOException: java.io.EOFException at org.apache.hadoop.hive.ql.exec.persistence.MapJoinObjectValue.readExternal(MapJoinObjectValue.java:109) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1792) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at org.apache.hadoop.hive.ql.util.jdbm.htree.HashBucket.readExternal(HashBucket.java:284) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1792) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1751) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at org.apache.hadoop.hive.ql.util.jdbm.helper.Serialization.deserialize(Serialization.java:106) at org.apache.hadoop.hive.ql.util.jdbm.helper.DefaultSerializer.deserialize(DefaultSerializer.java:106) at org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.fetch(BaseRecordManager.java:360) at org.apache.hadoop.hive.ql.util.jdbm.recman.BaseRecordManager.fetch(BaseRecordManager.java:332) at org.apache.hadoop.hive.ql.util.jdbm.htree.HashDirectory.get(HashDirectory.java:195) at org.apache.hadoop.hive.ql.util.jdbm.htree.HTree.get(HTree.java:155) at org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper.get(HashMapWrapper.java:114) ... 11 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2776) at java.io.ObjectInputStream.readInt(ObjectInputStream.java:950) at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:153) at org.apache.hadoop.hive.ql.exec.persistence.MapJoinObjectValue.readExternal(MapJoinObjectValue.java:98) {verbatim} I am trying to create a testcase, which can demonstrate this error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1581) CompactIndexInputFormat should create split only for files in the index output file.
CompactIndexInputFormat should create split only for files in the index output file. Key: HIVE-1581 URL: https://issues.apache.org/jira/browse/HIVE-1581 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1581.1.patch We can get a list of files from the index file, so no need to create splits based on all files in the base table/partition -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1581) CompactIndexInputFormat should create split only for files in the index output file.
[ https://issues.apache.org/jira/browse/HIVE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1581: --- Attachment: HIVE-1581.1.patch CompactIndexInputFormat should create split only for files in the index output file. Key: HIVE-1581 URL: https://issues.apache.org/jira/browse/HIVE-1581 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1581.1.patch We can get a list of files from the index file, so no need to create splits based on all files in the base table/partition -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1581) CompactIndexInputFormat should create split only for files in the index output file.
[ https://issues.apache.org/jira/browse/HIVE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1581: --- Attachment: (was: HIVE-1581.1.patch) CompactIndexInputFormat should create split only for files in the index output file. Key: HIVE-1581 URL: https://issues.apache.org/jira/browse/HIVE-1581 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang Assignee: He Yongqiang We can get a list of files from the index file, so no need to create splits based on all files in the base table/partition -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1582) merge mapfiles task behaves incorrectly for 'inserting overwrite directory...'
merge mapfiles task behaves incorrectly for 'inserting overwrite directory...' -- Key: HIVE-1582 URL: https://issues.apache.org/jira/browse/HIVE-1582 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang hive SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; hiveSET hive.exec.compress.output=false; hiveINSERT OVERWRITE DIRECTORY 'x' SELECT from a; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator .. Ended Job = job_201008191557_54169 Ended Job = 450290112, job is filtered out (removed at runtime). Launching Job 2 out of 2 . the second job should not get started. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1581) CompactIndexInputFormat should create split only for files in the index output file.
[ https://issues.apache.org/jira/browse/HIVE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1581: --- Attachment: HIVE-1581.1.patch CompactIndexInputFormat should create split only for files in the index output file. Key: HIVE-1581 URL: https://issues.apache.org/jira/browse/HIVE-1581 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1581.1.patch We can get a list of files from the index file, so no need to create splits based on all files in the base table/partition -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1581) CompactIndexInputFormat should create split only for files in the index output file.
[ https://issues.apache.org/jira/browse/HIVE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1581: --- Status: Patch Available (was: Open) CompactIndexInputFormat should create split only for files in the index output file. Key: HIVE-1581 URL: https://issues.apache.org/jira/browse/HIVE-1581 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang Assignee: He Yongqiang Attachments: HIVE-1581.1.patch We can get a list of files from the index file, so no need to create splits based on all files in the base table/partition -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1582) merge mapfiles task behaves incorrectly for 'inserting overwrite directory...'
[ https://issues.apache.org/jira/browse/HIVE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901242#action_12901242 ] He Yongqiang commented on HIVE-1582: Ended Job = 450290112, job is filtered out (removed at runtime). the second job seems be filtered out at runtime merge mapfiles task behaves incorrectly for 'inserting overwrite directory...' -- Key: HIVE-1582 URL: https://issues.apache.org/jira/browse/HIVE-1582 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang hive SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; hiveSET hive.exec.compress.output=false; hiveINSERT OVERWRITE DIRECTORY 'x' SELECT from a; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator .. Ended Job = job_201008191557_54169 Ended Job = 450290112, job is filtered out (removed at runtime). Launching Job 2 out of 2 . the second job should not get started. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900904#action_12900904 ] He Yongqiang commented on HIVE-1510: even without this patch, the 0.17 test failed on index_compat3.q. Please file a separate jira for this issue. HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch, hive-1510.3.patch, hive-1510.4.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1510: --- Attachment: hive-1510.4.patch HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch, hive-1510.3.patch, hive-1510.4.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1567) increase hive.mapjoin.maxsize to 10 million
increase hive.mapjoin.maxsize to 10 million --- Key: HIVE-1567 URL: https://issues.apache.org/jira/browse/HIVE-1567 Project: Hadoop Hive Issue Type: Improvement Reporter: He Yongqiang i saw in a very wide table, hive can process 1million rows in less than one minute (select all columns). setting the hive.mapjoin.maxsize to 100k is kind of too restrictive. Let's increase this to 10 million. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1561) smb_mapjoin_8.q returns different results in miniMr mode
[ https://issues.apache.org/jira/browse/HIVE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900384#action_12900384 ] He Yongqiang commented on HIVE-1561: Amareshwari, did you use BucketizedHiveInputFormat for your query? SMBJoin can only work with BucketizedHiveInputFormat. smb_mapjoin_8.q returns different results in miniMr mode Key: HIVE-1561 URL: https://issues.apache.org/jira/browse/HIVE-1561 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang follow on to HIVE-1523: ant -Dclustermode=miniMR -Dtestcase=TestCliDriver -Dqfile=smb_mapjoin_8.q test POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key official results: 4 val_356 NULL NULL NULL NULL 484 val_169 2000 val_169 NULL NULL NULL NULL 3000 val_169 4000 val_125 NULL NULL in minimr mode: 2000 val_169 NULL NULL 4 val_356 NULL NULL 2000 val_169 NULL NULL 4000 val_125 NULL NULL NULL NULL 5000 val_125 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1564) bucketizedhiveinputformat.q fails in minimr mode
[ https://issues.apache.org/jira/browse/HIVE-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1564: -- Assignee: He Yongqiang bucketizedhiveinputformat.q fails in minimr mode Key: HIVE-1564 URL: https://issues.apache.org/jira/browse/HIVE-1564 Project: Hadoop Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1564.1.patch followup to HIVE-1523: ant -Dtestcase=TestCliDriver -Dqfile=bucketizedhiveinputformat.q -Dclustermode=miniMR clean-test test [junit] Begin query: bucketizedhiveinputformat.q [junit] Exception: null [junit] java.lang.AssertionError [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:788) [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:624) [junit] at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120) ExecDriver.java:788 // These tasks should have come from the same job. assert(ti.getJobId() == jobId); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1564) bucketizedhiveinputformat.q fails in minimr mode
[ https://issues.apache.org/jira/browse/HIVE-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1564: --- Status: Patch Available (was: Open) Fix Version/s: 0.7.0 bucketizedhiveinputformat.q fails in minimr mode Key: HIVE-1564 URL: https://issues.apache.org/jira/browse/HIVE-1564 Project: Hadoop Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1564.1.patch followup to HIVE-1523: ant -Dtestcase=TestCliDriver -Dqfile=bucketizedhiveinputformat.q -Dclustermode=miniMR clean-test test [junit] Begin query: bucketizedhiveinputformat.q [junit] Exception: null [junit] java.lang.AssertionError [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:788) [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:624) [junit] at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120) ExecDriver.java:788 // These tasks should have come from the same job. assert(ti.getJobId() == jobId); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1564) bucketizedhiveinputformat.q fails in minimr mode
[ https://issues.apache.org/jira/browse/HIVE-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1564: --- Attachment: hive-1564.1.patch bucketizedhiveinputformat.q fails in minimr mode Key: HIVE-1564 URL: https://issues.apache.org/jira/browse/HIVE-1564 Project: Hadoop Hive Issue Type: Bug Reporter: Joydeep Sen Sarma Fix For: 0.7.0 Attachments: hive-1564.1.patch followup to HIVE-1523: ant -Dtestcase=TestCliDriver -Dqfile=bucketizedhiveinputformat.q -Dclustermode=miniMR clean-test test [junit] Begin query: bucketizedhiveinputformat.q [junit] Exception: null [junit] java.lang.AssertionError [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.showJobFailDebugInfo(ExecDriver.java:788) [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:624) [junit] at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120) ExecDriver.java:788 // These tasks should have come from the same job. assert(ti.getJobId() == jobId); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1561) smb_mapjoin_8.q returns different results in miniMr mode
[ https://issues.apache.org/jira/browse/HIVE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1561: --- Status: Patch Available (was: Open) smb_mapjoin_8.q returns different results in miniMr mode Key: HIVE-1561 URL: https://issues.apache.org/jira/browse/HIVE-1561 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang Attachments: hive-1561.1.patch follow on to HIVE-1523: ant -Dclustermode=miniMR -Dtestcase=TestCliDriver -Dqfile=smb_mapjoin_8.q test POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key official results: 4 val_356 NULL NULL NULL NULL 484 val_169 2000 val_169 NULL NULL NULL NULL 3000 val_169 4000 val_125 NULL NULL in minimr mode: 2000 val_169 NULL NULL 4 val_356 NULL NULL 2000 val_169 NULL NULL 4000 val_125 NULL NULL NULL NULL 5000 val_125 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1561) smb_mapjoin_8.q returns different results in miniMr mode
[ https://issues.apache.org/jira/browse/HIVE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1561: --- Attachment: hive-1561.1.patch smb_mapjoin_8.q returns different results in miniMr mode Key: HIVE-1561 URL: https://issues.apache.org/jira/browse/HIVE-1561 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang Attachments: hive-1561.1.patch follow on to HIVE-1523: ant -Dclustermode=miniMR -Dtestcase=TestCliDriver -Dqfile=smb_mapjoin_8.q test POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key official results: 4 val_356 NULL NULL NULL NULL 484 val_169 2000 val_169 NULL NULL NULL NULL 3000 val_169 4000 val_125 NULL NULL in minimr mode: 2000 val_169 NULL NULL 4 val_356 NULL NULL 2000 val_169 NULL NULL 4000 val_125 NULL NULL NULL NULL 5000 val_125 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1569) groupby_bigdata.q fails in minimr mode
[ https://issues.apache.org/jira/browse/HIVE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang resolved HIVE-1569. Resolution: Invalid local mode and miniMR are using different filesystems, so there is no single script path that work for both. groupby_bigdata.q fails in minimr mode -- Key: HIVE-1569 URL: https://issues.apache.org/jira/browse/HIVE-1569 Project: Hadoop Hive Issue Type: Bug Components: Testing Infrastructure Reporter: Namit Jain Assignee: He Yongqiang -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1572) skewjoin.q output in minimr differs from local mode
[ https://issues.apache.org/jira/browse/HIVE-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1572: -- Assignee: He Yongqiang skewjoin.q output in minimr differs from local mode --- Key: HIVE-1572 URL: https://issues.apache.org/jira/browse/HIVE-1572 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang checked in results: POSTHOOK: query: SELECT sum(hash(src1.key)), sum(hash(src1.val)), sum(hash(src2.key)) FROM T1 src1 JOIN T2 src2 ON src1.key+1 = src2.key JOIN T2 src3 ON src2.key = src3.key 370 11003 377 in minimr mode: POSTHOOK: query: SELECT sum(hash(src1.key)), sum(hash(src1.val)), sum(hash(src2.key)) FROM T1 src1 JOIN T2 src2 ON src1.key+1 = src2.key JOIN T2 src3 ON src2.key = src3.key 150 4707 153 it seems that the query is deterministic - so filing a bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1510: --- Attachment: hive-1510.3.patch HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch, hive-1510.3.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900063#action_12900063 ] He Yongqiang commented on HIVE-1510: the IOPrepareCache is cleared in Driver, which should only contain generic code irrespect to task types. Can you do it in ExecDriver.execute()? This will new cache is only used in ExecDriver anyways. ExecDriver is per map-reduce task. Driver is per query. We should do this for query granularity. I think the pathToPartitionDesc is also per query map? some comments on why you need a new hash map keyed with the paths only will be helpful. will do it in a next patch. HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch, hive-1510.3.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1561) smb_mapjoin_8.q returns different results in miniMr mode
[ https://issues.apache.org/jira/browse/HIVE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1561: -- Assignee: He Yongqiang smb_mapjoin_8.q returns different results in miniMr mode Key: HIVE-1561 URL: https://issues.apache.org/jira/browse/HIVE-1561 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang follow on to HIVE-1523: ant -Dclustermode=miniMR -Dtestcase=TestCliDriver -Dqfile=smb_mapjoin_8.q test POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key official results: 4 val_356 NULL NULL NULL NULL 484 val_169 2000 val_169 NULL NULL NULL NULL 3000 val_169 4000 val_125 NULL NULL in minimr mode: 2000 val_169 NULL NULL 4 val_356 NULL NULL 2000 val_169 NULL NULL 4000 val_125 NULL NULL NULL NULL 5000 val_125 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900074#action_12900074 ] He Yongqiang commented on HIVE-1510: About the additional hashmap added, it is used to match path to partitionDesc by discarding partitionDesc's schema information. In the long run, we should normalize all input path to let them contain full schema and authorization information. This is a must to let hive work with multiple hdfs clusters. HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch, hive-1510.3.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1561) smb_mapjoin_8.q returns different results in miniMr mode
[ https://issues.apache.org/jira/browse/HIVE-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900161#action_12900161 ] He Yongqiang commented on HIVE-1561: This is the complete result from Hive's smb_mapjoin_8.q.out, it's correct: {noformat} POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key POSTHOOK: type: QUERY POSTHOOK: Input: defa...@smb_bucket4_2 POSTHOOK: Input: defa...@smb_bucket4_1 POSTHOOK: Output: file:/tmp/jssarma/hive_2010-07-21_12-02-34_137_8141051139723931378/1 POSTHOOK: Lineage: smb_bucket4_1.key SIMPLE [(smb_bucket_input)smb_bucket_input.FieldSchema(name:key, type:int, comment:from deserializer), ] POSTHOOK: Lineage: smb_bucket4_1.value SIMPLE [(smb_bucket_input)smb_bucket_input.FieldSchema(name:value, type:string, comment:from deserializer), ] POSTHOOK: Lineage: smb_bucket4_2.key SIMPLE [(smb_bucket_input)smb_bucket_input.FieldSchema(name:key, type:int, comment:from deserializer), ] POSTHOOK: Lineage: smb_bucket4_2.value SIMPLE [(smb_bucket_input)smb_bucket_input.FieldSchema(name:value, type:string, comment:from deserializer), ] 4 val_356 NULLNULL NULLNULL484 val_169 2000val_169 NULLNULL NULLNULL3000val_169 4000val_125 NULLNULL NULLNULL5000val_125 {noformat} smb_mapjoin_8.q returns different results in miniMr mode Key: HIVE-1561 URL: https://issues.apache.org/jira/browse/HIVE-1561 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Joydeep Sen Sarma Assignee: He Yongqiang follow on to HIVE-1523: ant -Dclustermode=miniMR -Dtestcase=TestCliDriver -Dqfile=smb_mapjoin_8.q test POSTHOOK: query: select /*+mapjoin(a)*/ * from smb_bucket4_1 a full outer join smb_bucket4_2 b on a.key = b.key official results: 4 val_356 NULL NULL NULL NULL 484 val_169 2000 val_169 NULL NULL NULL NULL 3000 val_169 4000 val_125 NULL NULL in minimr mode: 2000 val_169 NULL NULL 4 val_356 NULL NULL 2000 val_169 NULL NULL 4000 val_125 NULL NULL NULL NULL 5000 val_125 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1203) HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion
[ https://issues.apache.org/jira/browse/HIVE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899488#action_12899488 ] He Yongqiang commented on HIVE-1203: Vladimir, can you update the patch? After that, i will test and commit it. HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion Key: HIVE-1203 URL: https://issues.apache.org/jira/browse/HIVE-1203 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.4.0, 0.4.1, 0.5.0 Reporter: Vladimir Klimontovich Assignee: Vladimir Klimontovich Attachments: 0.4.patch, 0.5.patch, trunk.patch To fix this it's simply needed to add second parameter to IOException constructor. Patches for 0.4, 0.5 and trunk are available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1203) HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion
[ https://issues.apache.org/jira/browse/HIVE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1203: --- Status: Open (was: Patch Available) Affects Version/s: (was: 0.4.0) (was: 0.5.0) (was: 0.4.1) Fix Version/s: 0.7.0 HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion Key: HIVE-1203 URL: https://issues.apache.org/jira/browse/HIVE-1203 Project: Hadoop Hive Issue Type: Bug Reporter: Vladimir Klimontovich Assignee: Vladimir Klimontovich Fix For: 0.7.0 Attachments: 0.4.patch, 0.5.patch, trunk.patch To fix this it's simply needed to add second parameter to IOException constructor. Patches for 0.4, 0.5 and trunk are available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1548) populate inputs and outputs for all statements
[ https://issues.apache.org/jira/browse/HIVE-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899584#action_12899584 ] He Yongqiang commented on HIVE-1548: running test now. populate inputs and outputs for all statements -- Key: HIVE-1548 URL: https://issues.apache.org/jira/browse/HIVE-1548 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.1548.1.patch Currently, they are only populated for queries - and not for most of the DDLs. The pre and post execution hooks do not get the correct values. It would also be very useful for locking -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1548) populate inputs and outputs for all statements
[ https://issues.apache.org/jira/browse/HIVE-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1548: --- Status: Resolved (was: Patch Available) Fix Version/s: 0.7.0 Resolution: Fixed I just committed! Thanks namit! populate inputs and outputs for all statements -- Key: HIVE-1548 URL: https://issues.apache.org/jira/browse/HIVE-1548 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.7.0 Attachments: hive.1548.1.patch Currently, they are only populated for queries - and not for most of the DDLs. The pre and post execution hooks do not get the correct values. It would also be very useful for locking -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898996#action_12898996 ] He Yongqiang commented on HIVE-741: --- the change looks good to me. Can you also add one or few tests for sort merge join? NULL is not handled correctly in join - Key: HIVE-741 URL: https://issues.apache.org/jira/browse/HIVE-741 Project: Hadoop Hive Issue Type: Bug Reporter: Ning Zhang Assignee: Amareshwari Sriramadasu Attachments: patch-741.txt With the following data in table input4_cb: KeyValue -- NULL 325 18 NULL The following query: {code} select * from input4_cb a join input4_cb b on a.key = b.value; {code} returns the following result: NULL32518 NULL The correct result should be empty set. When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
[ https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899086#action_12899086 ] He Yongqiang commented on HIVE-1510: Since HIVE-1515 depends on Hadoop, can we close this jira without adding new archive testcases. HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path Key: HIVE-1510 URL: https://issues.apache.org/jira/browse/HIVE-1510 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1510.1.patch set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; alter table combine_3_srcpart_seq_rc set fileformat rcfile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=00); desc extended combine_3_srcpart_seq_rc partition(ds=2010-08-03, hr=001); select * from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key; drop table combine_3_srcpart_seq_rc; will fail. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1543) set abort in ExecMapper when Hive's record reader got an IOException
[ https://issues.apache.org/jira/browse/HIVE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899098#action_12899098 ] He Yongqiang commented on HIVE-1543: let's do it in HiveContextAwareRecordReader. And maybe store the var in IOContext? set abort in ExecMapper when Hive's record reader got an IOException Key: HIVE-1543 URL: https://issues.apache.org/jira/browse/HIVE-1543 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Fix For: 0.7.0 Attachments: HIVE-1543.patch When RecordReader got an IOException, ExecMapper does not know and will close the operators as if there is not error. We should catch this exception and avoid writing partial results to HDFS which will be removed later anyways. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1543) set abort in ExecMapper when Hive's record reader got an IOException
[ https://issues.apache.org/jira/browse/HIVE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899169#action_12899169 ] He Yongqiang commented on HIVE-1543: we can do two different patches for trunk and 0.6. I think BucketizedHiveRecordReader also extends HiveContextAwareRecordReader. set abort in ExecMapper when Hive's record reader got an IOException Key: HIVE-1543 URL: https://issues.apache.org/jira/browse/HIVE-1543 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Fix For: 0.7.0 Attachments: HIVE-1543.patch When RecordReader got an IOException, ExecMapper does not know and will close the operators as if there is not error. We should catch this exception and avoid writing partial results to HDFS which will be removed later anyways. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1532) Replace globStatus with listStatus inside Hive.java's replaceFiles.
[ https://issues.apache.org/jira/browse/HIVE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1532: --- Attachment: Hive-1532.1.patch Replace globStatus with listStatus inside Hive.java's replaceFiles. --- Key: HIVE-1532 URL: https://issues.apache.org/jira/browse/HIVE-1532 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: Hive-1532.1.patch globStatus expects a regular expression, so if there is special characters (like '{' , '[') in the filepath, this function will fail. We should be able to replace this call with listStatus easily since we are not passing regex to replaceFiles(). The only places replaceFiles is called is in loadPartition and Table's replaceFiles. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1514) Be able to modify a partition's fileformat and file location information.
[ https://issues.apache.org/jira/browse/HIVE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898350#action_12898350 ] He Yongqiang commented on HIVE-1514: I updated the wiki page here : http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Alter_Table.2BAC8-Partition_Location This only change the metadata. With this patch, you will be able to let the partition point to some external places, and use a new fileformat. If the metadata you specified is correct, you will be able to do that. Be able to modify a partition's fileformat and file location information. - Key: HIVE-1514 URL: https://issues.apache.org/jira/browse/HIVE-1514 Project: Hadoop Hive Issue Type: New Feature Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1514.1.patch, hive-1514.2.patch, hive-1514.3.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1495) supply correct information to hooks and lineage for index rebuild
[ https://issues.apache.org/jira/browse/HIVE-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1495: --- Attachment: hive-1495.5.patch Sorry, forgot to update outputs for these two testcases. Will be more careful next time. supply correct information to hooks and lineage for index rebuild - Key: HIVE-1495 URL: https://issues.apache.org/jira/browse/HIVE-1495 Project: Hadoop Hive Issue Type: Improvement Components: Indexing Affects Versions: 0.7.0 Reporter: John Sichi Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1495.1.patch, hive-1495.2.patch, hive-1495.3.patch, hive-1495.4.patch, hive-1495.5.patch This is a followup for HIVE-417. Ashish can probably help on how this should work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.
[ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1515: --- Attachment: hive-1515.2.patch Attache a possible fix. Talked with Namit and Paul this afternoon about this issue. Actually there is config which can disable FileSystem cache: fs.%s.impl.disable.cache . where %s is the filesystem schema, for archive, it's har. So if you set fs.har.impl.disable.cache to false, the archive will automatically work. This should be the clean way to fix this issue. In order to do this, you need to apply https://issues.apache.org/jira/browse/HADOOP-6231 if your hadoop does not include the code to disable FileSystem cache. archive is not working when multiple partitions inside one table are archived. -- Key: HIVE-1515 URL: https://issues.apache.org/jira/browse/HIVE-1515 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1515.1.patch, hive-1515.2.patch set hive.exec.compress.output = true; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=256; set mapred.min.split.size.per.node=256; set mapred.min.split.size.per.rack=256; set mapred.max.split.size=256; set hive.archive.enabled = true; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=00); ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=001); select key, value, ds, hr from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key, hr limit 30; drop table combine_3_srcpart_seq_rc; will fail. java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason it fails is because: there are 2 input paths (one for each partition) for the above query: 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.
[ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1515: --- Assignee: (was: He Yongqiang) archive is not working when multiple partitions inside one table are archived. -- Key: HIVE-1515 URL: https://issues.apache.org/jira/browse/HIVE-1515 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.7.0 Reporter: He Yongqiang Attachments: hive-1515.1.patch, hive-1515.2.patch set hive.exec.compress.output = true; set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=256; set mapred.min.split.size.per.node=256; set mapred.min.split.size.per.rack=256; set mapred.max.split.size=256; set hive.archive.enabled = true; drop table combine_3_srcpart_seq_rc; create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=00) select * from src; insert overwrite table combine_3_srcpart_seq_rc partition (ds=2010-08-03, hr=001) select * from src; ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=00); ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds=2010-08-03, hr=001); select key, value, ds, hr from combine_3_srcpart_seq_rc where ds=2010-08-03 order by key, hr limit 30; drop table combine_3_srcpart_seq_rc; will fail. java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason it fails is because: there are 2 input paths (one for each partition) for the above query: 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-1535) alter partition should throw exception if the specified partition does not exist.
alter partition should throw exception if the specified partition does not exist. - Key: HIVE-1535 URL: https://issues.apache.org/jira/browse/HIVE-1535 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1532) Replace globStatus with listStatus inside Hive.java's replaceFiles.
[ https://issues.apache.org/jira/browse/HIVE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1532: -- Assignee: He Yongqiang Replace globStatus with listStatus inside Hive.java's replaceFiles. --- Key: HIVE-1532 URL: https://issues.apache.org/jira/browse/HIVE-1532 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang globStatus expects a regular expression, so if there is special characters (like '{' , '[') in the filepath, this function will fail. We should be able to replace this call with listStatus easily since we are not passing regex to replaceFiles(). The only places replaceFiles is called is in loadPartition and Table's replaceFiles. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1522) replace columns should prohibit using partition column names.
[ https://issues.apache.org/jira/browse/HIVE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang reassigned HIVE-1522: -- Assignee: He Yongqiang replace columns should prohibit using partition column names. - Key: HIVE-1522 URL: https://issues.apache.org/jira/browse/HIVE-1522 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang create table src_part_w(key int , value string) partitioned by (ds string, hr int); alter table src_part_w replace columns (key int, ds string, hr int, value string); should not be allowed. Once the alter table replace columns ... is done, all commands on this table will fail. And not able to change the schema back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1535) alter partition should throw exception if the specified partition does not exist.
[ https://issues.apache.org/jira/browse/HIVE-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1535: --- Attachment: hive-1535.1.patch No negative tests included because hive is using local meta store, and throw exception if the partition does not exist. So there is no problem when running with local meta store. alter partition should throw exception if the specified partition does not exist. - Key: HIVE-1535 URL: https://issues.apache.org/jira/browse/HIVE-1535 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1535.1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1535) alter partition should throw exception if the specified partition does not exist.
[ https://issues.apache.org/jira/browse/HIVE-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1535: --- Status: Patch Available (was: Open) alter partition should throw exception if the specified partition does not exist. - Key: HIVE-1535 URL: https://issues.apache.org/jira/browse/HIVE-1535 Project: Hadoop Hive Issue Type: Bug Reporter: He Yongqiang Assignee: He Yongqiang Attachments: hive-1535.1.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1495) supply correct information to hooks and lineage for index rebuild
[ https://issues.apache.org/jira/browse/HIVE-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1495: --- Attachment: hive-1495.3.patch supply correct information to hooks and lineage for index rebuild - Key: HIVE-1495 URL: https://issues.apache.org/jira/browse/HIVE-1495 Project: Hadoop Hive Issue Type: Improvement Components: Indexing Affects Versions: 0.7.0 Reporter: John Sichi Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-1495.1.patch, hive-1495.2.patch, hive-1495.3.patch This is a followup for HIVE-417. Ashish can probably help on how this should work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.