[jira] Created: (HIVE-1206) Return row results in a list instead of a tab-delimited string
Return row results in a list instead of a tab-delimited string -- Key: HIVE-1206 URL: https://issues.apache.org/jira/browse/HIVE-1206 Project: Hadoop Hive Issue Type: Bug Reporter: bc Wong Driver.getResults() returns each row as a string, with fields tab delimited always. This breaks for data with tabs. It'd be really nice if the interface allows returning the row as a list of fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839391#action_12839391 ] He Yongqiang commented on HIVE-259: --- The code looks very good. Thanks for the code work, Jerome and Zheng! Just some minor comments: (1) I am not familiar with the exact definition of percentile function. Is the percentile()'s result must be a member of input data? (2) HashMap and ArrayList is used to copy and sort. Can we use tree map here? this is a small and can be ignored. In the beginning of new test case, DESCRIBE FUNCTION percentile; DESCRIBE FUNCTION EXTENDED percentile; appears two times. And this is a very good function to have, it will be great if we can update its usage to the wiki page or somewhere. Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zheng Shao updated HIVE-259: Attachment: HIVE-259.5.patch We take the method recommended by NIST. See http://en.wikipedia.org/wiki/Percentile#Alternative_methods Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839393#action_12839393 ] Zheng Shao commented on HIVE-259: - (1) I am not familiar with the exact definition of percentile function. Is the percentile()'s result must be a member of input data? See the link above. (2) HashMap and ArrayList is used to copy and sort. Can we use tree map here? this is a small and can be ignored. In the beginning of new test case, I think HashMap is better here. The reason is that the number of iterate is usually much higher than the number of unique numbers (the size of the HashMap). By using HashMap we reduce the cost of iterate. In the beginning of new test case, .. appears two times Fixed in HIVE-259.5.patch Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839394#action_12839394 ] He Yongqiang commented on HIVE-259: --- looks good, will test and commit. Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-259) Add PERCENTILE aggregate function
[ https://issues.apache.org/jira/browse/HIVE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-259: -- Resolution: Fixed Fix Version/s: 0.6.0 Release Note: Add PERCENTILE aggregate function Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed. Thanks for the hard work, Jerome Boulon and Zheng. Btw, i manually fixed a show_function.q diff. Please update the usage of percentile function on the wiki or somewhere. Add PERCENTILE aggregate function - Key: HIVE-259 URL: https://issues.apache.org/jira/browse/HIVE-259 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Venky Iyer Assignee: Jerome Boulon Fix For: 0.6.0 Attachments: HIVE-259-2.patch, HIVE-259-3.patch, HIVE-259.1.patch, HIVE-259.4.patch, HIVE-259.5.patch, HIVE-259.patch, jb2.txt, Percentile.xlsx Compute atleast 25, 50, 75th percentiles -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Hive-trunk-h0.17 #375
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/375/changes Changes: [heyongqiang] HIVE-259. Add PERCENTILE aggregate function.(Jerome Boulon, Zheng via He Yongqiang) -- [...truncated 10891 lines...] [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_function4.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_function4.q.out [junit] Done query: unknown_function4.q [junit] Begin query: unknown_table1.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_table1.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_table1.q.out [junit] Done query: unknown_table1.q [junit] Begin query: unknown_table2.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output: defa...@srcbucket2 [junit] OK [junit] Loading data to table srcbucket2 [junit] POSTHOOK: Output:
[jira] Commented: (HIVE-1203) HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion
[ https://issues.apache.org/jira/browse/HIVE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839427#action_12839427 ] Vladimir Klimontovich commented on HIVE-1203: - Well, I can't assign this issue. Seems like I don't have enough permissions HiveInputFormat.getInputFormatFromCache swallows cause exception when trowing IOExcpetion Key: HIVE-1203 URL: https://issues.apache.org/jira/browse/HIVE-1203 Project: Hadoop Hive Issue Type: Bug Affects Versions: 0.4.0, 0.4.1, 0.5.0 Reporter: Vladimir Klimontovich Fix For: 0.4.2, 0.5.1, 0.6.0 Attachments: 0.4.patch, 0.5.patch, trunk.patch To fix this it's simply needed to add second parameter to IOException constructor. Patches for 0.4, 0.5 and trunk are available. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-600) Running TPC-H queries on Hive
[ https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839478#action_12839478 ] Kamil Bajda-Pawlikowski commented on HIVE-600: -- Hi Yuntao, I have attempted to run TPC-H on Hive. Thanks for really well prepared scripts! During the first query, I realized that things are not going well. It seems that Aaron's concern about the number of reducers was valid one. However, the problem is that Hive schedules too many reducers! The default configuration of Hive tries to determine # of tasks automatically using value of hive.exec.reducers.bytes.per.reducer property (the default setting is to have one reduce task per 1GB of input data). When the size of the data is huge, this is inefficient. This needs to capped! For example in my case, there is 50GB data per node, but only 2 reduce task slots and I'm getting 25 reduce task waves. Q1 ran for 1h49min. In contrast, when I set hive.exec.reducers.max property to the number of reduce slots in my Hadoop installation, the query running time is only about 23min. Of note, the default value for hive.exec.reducers.max is 999. The above issue was not too bad for the data size you used. TPC-H dataset with SF=100 translates into at most 100 reducers per job, and with 40 reduce slots in total, each job had max. 2.5 reduce task waves. Still, your numbers could be somewhat better by capping hive.exec.reducers.max to 40 per Tom White's tip #9 from http://www.cloudera.com/blog/2009/05/10-mapreduce-tips. Could please confirm whether my understanding is correct. Thank you, Kamil Running TPC-H queries on Hive - Key: HIVE-600 URL: https://issues.apache.org/jira/browse/HIVE-600 Project: Hadoop Hive Issue Type: New Feature Reporter: Yuntao Jia Assignee: Yuntao Jia Attachments: TPC-H_on_Hive_2009-08-11.pdf, TPC-H_on_Hive_2009-08-11.tar.gz, TPC-H_on_Hive_2009-08-14.tar.gz The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on Hive for two reasons. First, through those queries, we would like to find the new features that we need to put into Hive so that Hive supports common SQL queries. Second, we would like to measure the performance of Hive to find out what Hive is not good at. We can then improve Hive based on those information. For queries that are not supported now in Hive, I will try to rewrite them to one or more Hive-supported queries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1204) typedbytes: writing to stderr kills the mapper
[ https://issues.apache.org/jira/browse/HIVE-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1204: --- Resolution: Fixed Release Note: typedbytes: writing to stderr kills the mapper Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed. Thanks Namit! typedbytes: writing to stderr kills the mapper -- Key: HIVE-1204 URL: https://issues.apache.org/jira/browse/HIVE-1204 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Fix For: 0.6.0 Attachments: hive.1204.1.patch, hive.1204.2.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839515#action_12839515 ] Mafish commented on HIVE-1202: -- @Yongqaing I ran the query: select a.name, b.* from classes a join classes b on a.name = b.number where a.name b.number It passed. In this case, two tables are physical. But when I changed one of them to sub-query, error occured again, as: select a.name, b.* from (select name from classes) a join classes b on a.name = b.number where a.name b.number ; Please try this case. Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839546#action_12839546 ] He Yongqiang commented on HIVE-1202: {quote} But when I changed one of them to sub-query, error occured again, as: select a.name, b.* from (select name from classes) a join classes b on a.name = b.number where a.name b.number ; {quote} what's the error for this query? Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839547#action_12839547 ] Mafish commented on HIVE-1202: -- Error message in hive is as title: Unknown exception : null. And the call stack is the same as my first comment. Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839551#action_12839551 ] He Yongqiang commented on HIVE-1202: I tried this query with the trunk code. it works fine. hive select a.name, b.* from (select name from classes) a join classes b on a.name = b.number where a.nameb.number; Total MapReduce jobs = 1 Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839552#action_12839552 ] Mafish commented on HIVE-1202: -- Which trunk are you using? I'm using release 0.4.1, which is checkoed out from http://svn.apache.org/repos/asf/hadoop/hive/branches/branch-0.4 $ svn info Path: . URL: http://svn.apache.org/repos/asf/hadoop/hive/branches/branch-0.4 Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 916543 Node Kind: directory Schedule: normal Last Changed Author: nzhang Last Changed Rev: 912061 Last Changed Date: 2010-02-20 09:44:44 +0800 (Sat, 20 Feb 2010) Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1202) Unknown exception : null while join
[ https://issues.apache.org/jira/browse/HIVE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839555#action_12839555 ] He Yongqiang commented on HIVE-1202: Please try http://svn.apache.org/repos/asf/hadoop/hive/trunk or you can download the latest stable 0.5 version from http://hadoop.apache.org/hive/releases.html#Download Unknown exception : null while join - Key: HIVE-1202 URL: https://issues.apache.org/jira/browse/HIVE-1202 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.4.1 Environment: hive-0.4.1 hadoop 0.19.1 Reporter: Mafish Fix For: 0.4.1 Attachments: HIVE-1202.branch-0.4.1.patch Hive throws Unknown exception : null with query: select * from ( select name from classes ) a join classes b where a.name b.number After tracing the code, I found this bug will occur with following conditions: 1. It is join operation. 2. At least one of the source of join is physical table (right side in above case). 3. With where condition and condition(s) of where clause must include columns from both side of join (a.name and b.number in case) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file
[ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839575#action_12839575 ] He Yongqiang commented on HIVE-1197: Looks very good overall, congrats! just few minor comments: 1. Can you change inputFormatClassName to use getter and setter method? 2. some duplication code with HiveInputFormat, can we reuse them? 3. In BucketizedHiveRecordReader's next, i think should remove the check of curReader == null. we should throw an exception if curReader==null, which means the reader has been closed. 4. i think we should remove line 207 in BucketizedHiveInputFormat: newjob.setInputFormat(inputFormat.getClass()); 5. In HiveRecordReader, 5.1 progress is calculated based on (number of splits done) / (total split number), can we make it more accurate? Let's say the work is evenly divided among all splits. something like this: (number of splits done) / (total split number) + currReader.getProgess(); 5.2 getPos should return this currReader.getPos() Another one is do you think it is a good idea to let the BucketizedHiveInputFormat extend HiveInputFormat? That way, the code would be more clear. And we should put the RecordReader and InputSplit in the same file as BucketizedHiveInputFormat. create a new input format where a mapper spans a file - Key: HIVE-1197 URL: https://issues.apache.org/jira/browse/HIVE-1197 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong Fix For: 0.6.0 Attachments: hive.1197.1.patch This will be needed for Sort merge joins. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file
[ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839574#action_12839574 ] Namit Jain commented on HIVE-1197: -- Overall, looks good - some general comments. Would it be a good idea to make BucketizedHiveInputFormat extend HiveInpuFormat, and BucketizedHiveRecordReader extend HiveRecordReader ? You wont have to copy a lot of code, and it would be easy to maintain. For example, the check for ExecMapper in hiverecordreader and such future optimizations would be easier to maintain. create a new input format where a mapper spans a file - Key: HIVE-1197 URL: https://issues.apache.org/jira/browse/HIVE-1197 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong Fix For: 0.6.0 Attachments: hive.1197.1.patch This will be needed for Sort merge joins. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1197) create a new input format where a mapper spans a file
[ https://issues.apache.org/jira/browse/HIVE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839576#action_12839576 ] He Yongqiang commented on HIVE-1197: Correction about 5.1, it should be ((number of splits done) + currReader.getProgess() )/ (total split number) create a new input format where a mapper spans a file - Key: HIVE-1197 URL: https://issues.apache.org/jira/browse/HIVE-1197 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: Siying Dong Fix For: 0.6.0 Attachments: hive.1197.1.patch This will be needed for Sort merge joins. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1194) sorted merge join
[ https://issues.apache.org/jira/browse/HIVE-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-1194: --- Attachment: hive-1194-2010-02-28.patch for early review only. I will test it more and add more testcases. sorted merge join - Key: HIVE-1194 URL: https://issues.apache.org/jira/browse/HIVE-1194 Project: Hadoop Hive Issue Type: New Feature Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.6.0 Attachments: hive-1194-2010-02-28.patch If the input tables are sorted on the join key, and a mapjoin is being performed, it is useful to exploit the sorted properties of the table. This can lead to substantial cpu savings - this needs to work across bucketed map joins also. Since, sorted properties of a table are not enforced currently, a new parameter can be added to specify to use the sort-merge join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.