Build failed in Hudson: Hive-trunk-h0.17 #257
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/257/ -- [...truncated 8569 lines...] [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_function2.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_function2.q.out [junit] Done query: unknown_function2.q [junit] Begin query: unknown_function3.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_function3.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_function3.q.out [junit] Done query: unknown_function3.q [junit] Begin query: unknown_function4.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/build/ql/test/logs/negative/unknown_function4.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.17/ws/hive/ql/src/test/results/compiler/errors/unknown_function4.q.out [junit] Done query: unknown_function4.q [junit] Begin query: unknown_table1.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading
Build failed in Hudson: Hive-trunk-h0.20 #83
See http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/83/ -- [...truncated 10676 lines...] [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_function2.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_function2.q.out [junit] Done query: unknown_function2.q [junit] Begin query: unknown_function3.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_function3.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_function3.q.out [junit] Done query: unknown_function3.q [junit] Begin query: unknown_function4.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table srcbucket [junit] POSTHOOK: Output: defa...@srcbucket [junit] OK [junit] Loading data to table src [junit] POSTHOOK: Output: defa...@src [junit] OK [junit] Loading data to table src1 [junit] POSTHOOK: Output: defa...@src1 [junit] OK [junit] Loading data to table src_sequencefile [junit] POSTHOOK: Output: defa...@src_sequencefile [junit] OK [junit] Loading data to table src_thrift [junit] POSTHOOK: Output: defa...@src_thrift [junit] OK [junit] Loading data to table src_json [junit] POSTHOOK: Output: defa...@src_json [junit] OK [junit] diff http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/build/ql/test/logs/negative/unknown_function4.q.out http://hudson.zones.apache.org/hudson/job/Hive-trunk-h0.20/ws/hive/ql/src/test/results/compiler/errors/unknown_function4.q.out [junit] Done query: unknown_function4.q [junit] Begin query: unknown_table1.q [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-08, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-08/hr=12 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=11} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=11 [junit] OK [junit] Loading data to table srcpart partition {ds=2008-04-09, hr=12} [junit] POSTHOOK: Output: defa...@srcpart@ds=2008-04-09/hr=12 [junit] OK [junit] POSTHOOK: Output:
[jira] Created: (HIVE-906) split size should be increased for map-joins
split size should be increased for map-joins Key: HIVE-906 URL: https://issues.apache.org/jira/browse/HIVE-906 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 Had a offline discussion with Ning and Dhruba on this. It would be good to have a larger split size for the big table for map-joins. It can be a function of the size of the small table. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-840) no error if user specifies multiple columns of same name as output
[ https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain reassigned HIVE-840: --- Assignee: He Yongqiang Looks good - just one minor comment. Can you add the error message in ErrorMsg.java and then use it no error if user specifies multiple columns of same name as output -- Key: HIVE-840 URL: https://issues.apache.org/jira/browse/HIVE-840 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Attachments: hive-840-2009-10-28.patch INSERT OVERWRITE TABLE table_name_here SELECT TRANSFORM(key,val) USING '/script/' AS foo, foo, foo The above query should fail, but it succeeds -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers
[ https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771544#action_12771544 ] Ning Zhang commented on HIVE-900: - The essential problem is that there are too many mappers are trying to accessing the same block at the same time so that it exceeds the threshold of accessing the same block. Thus the BlockMissingException is thrown. Discussed with Namit and Dhruba offline. There are the proposed solutions: 1) Make the HDFS fault tolerant to this issue. Dhruba mentioned there already exists retry logic implemented in the DFS client code: if the BlockMissingException is throw it will wait about 400ms and retry. If there are still exceptions then wait for 800 ms and so on until 5 unsuccessful retry. This mechanism works for non-correlated simultaneous request of the same block. However in this case, almost all the mappers request the same block at the same time, so their retries will be also at about the same time. So it would be better to introduce a random factor into the wait time. Dhruba will look into the DFS code and working on that. This will solve a broader type of issues beside the map-side join. 2) Another orthogonal issue brought up by Namit for map-side join is that if there are too many mappers and each of them request the same small table, it comes with a cost of transferring the small file to all these mappers. Even though the BlockMissingException is resolved, the cost is still there and it is proportional to the number of mappers. In this respect it would be better to reduce the number of mappers. But it also comes with the cost that each mappers then has to deal with larger portion of the large table. So we have to tradeoff the network cost of the small table and the processing cost of the large table. Will come with a heuristic on tune the parameters to decide the number of mappers for map join. Map-side join failed if there are large number of mappers - Key: HIVE-900 URL: https://issues.apache.org/jira/browse/HIVE-900 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Map-side join is efficient when joining a huge table with a small table so that the mapper can read the small table into main memory and do join on each mapper. However, if there are too many mappers generated for the map join, a large number of mappers will simultaneously send request to read the same block of the small table. Currently Hadoop has a upper limit of the # of request of a the same block (250?). If that is reached a BlockMissingException will be thrown. That cause a lot of mappers been killed. Retry won't solve but worsen the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-907) NullPointerException in ErrorMsg.findSQLState
NullPointerException in ErrorMsg.findSQLState - Key: HIVE-907 URL: https://issues.apache.org/jira/browse/HIVE-907 Project: Hadoop Hive Issue Type: Bug Reporter: Zheng Shao NullPointerException is thrown when the mesg is null. This happens if an exception is thrown earlier with a null message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers
[ https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771551#action_12771551 ] Prasad Chakka commented on HIVE-900: just a of the wall idea, temporarily increase the replication factor for this block so that it is available in more racks thus reducing the network cost and also avoiding BlockMissingException. ofcourse, we need to find a way to reliably set the replication factor back to original setting. Map-side join failed if there are large number of mappers - Key: HIVE-900 URL: https://issues.apache.org/jira/browse/HIVE-900 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Map-side join is efficient when joining a huge table with a small table so that the mapper can read the small table into main memory and do join on each mapper. However, if there are too many mappers generated for the map join, a large number of mappers will simultaneously send request to read the same block of the small table. Currently Hadoop has a upper limit of the # of request of a the same block (250?). If that is reached a BlockMissingException will be thrown. That cause a lot of mappers been killed. Retry won't solve but worsen the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers
[ https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771553#action_12771553 ] Prasad Chakka commented on HIVE-900: @venky, may be you can unblock your work by manually increasing the replication factory to very high and then issuing the query? Map-side join failed if there are large number of mappers - Key: HIVE-900 URL: https://issues.apache.org/jira/browse/HIVE-900 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Map-side join is efficient when joining a huge table with a small table so that the mapper can read the small table into main memory and do join on each mapper. However, if there are too many mappers generated for the map join, a large number of mappers will simultaneously send request to read the same block of the small table. Currently Hadoop has a upper limit of the # of request of a the same block (250?). If that is reached a BlockMissingException will be thrown. That cause a lot of mappers been killed. Retry won't solve but worsen the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-907) NullPointerException in ErrorMsg.findSQLState
[ https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ning Zhang reassigned HIVE-907: --- Assignee: Ning Zhang NullPointerException in ErrorMsg.findSQLState - Key: HIVE-907 URL: https://issues.apache.org/jira/browse/HIVE-907 Project: Hadoop Hive Issue Type: Bug Reporter: Zheng Shao Assignee: Ning Zhang NullPointerException is thrown when the mesg is null. This happens if an exception is thrown earlier with a null message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-840) no error if user specifies multiple columns of same name as output
[ https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771562#action_12771562 ] He Yongqiang commented on HIVE-840: --- Thanks, Namit. You mean the error message column alias + name + already exists. We need a parameter 'name' in the error message, so it may not fit in ErrorMsg. no error if user specifies multiple columns of same name as output -- Key: HIVE-840 URL: https://issues.apache.org/jira/browse/HIVE-840 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Attachments: hive-840-2009-10-28.patch INSERT OVERWRITE TABLE table_name_here SELECT TRANSFORM(key,val) USING '/script/' AS foo, foo, foo The above query should fail, but it succeeds -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers
[ https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771569#action_12771569 ] Ning Zhang commented on HIVE-900: - @parasad, yes that's definitely a good idea to scale out mapjoin with a large number of mappers. Dhruba also suggested to increase the replication factor for the small file. But as you mentioned, we need to revert the replication factor before mapjoin finishes or any exception is caught. I'll also investigate that. Map-side join failed if there are large number of mappers - Key: HIVE-900 URL: https://issues.apache.org/jira/browse/HIVE-900 Project: Hadoop Hive Issue Type: Improvement Reporter: Ning Zhang Assignee: Ning Zhang Map-side join is efficient when joining a huge table with a small table so that the mapper can read the small table into main memory and do join on each mapper. However, if there are too many mappers generated for the map join, a large number of mappers will simultaneously send request to read the same block of the small table. Currently Hadoop has a upper limit of the # of request of a the same block (250?). If that is reached a BlockMissingException will be thrown. That cause a lot of mappers been killed. Retry won't solve but worsen the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-840) no error if user specifies multiple columns of same name as output
[ https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771573#action_12771573 ] Namit Jain commented on HIVE-840: - There are existing error messages which are used with parameters: look at TABLE_ALREADY_EXISTS. Cant you follow the same approach ? The only restriction is that the column name will appear at the end, which might be acceptable. no error if user specifies multiple columns of same name as output -- Key: HIVE-840 URL: https://issues.apache.org/jira/browse/HIVE-840 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Attachments: hive-840-2009-10-28.patch INSERT OVERWRITE TABLE table_name_here SELECT TRANSFORM(key,val) USING '/script/' AS foo, foo, foo The above query should fail, but it succeeds -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-907) NullPointerException in ErrorMsg.findSQLState
[ https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771582#action_12771582 ] Bill Graham commented on HIVE-907: -- Unless you're already working on this, I have a patch that I can submit. Just let me know. As I'm sure you're aware, It's a quick fix. NullPointerException in ErrorMsg.findSQLState - Key: HIVE-907 URL: https://issues.apache.org/jira/browse/HIVE-907 Project: Hadoop Hive Issue Type: Bug Reporter: Zheng Shao Assignee: Ning Zhang NullPointerException is thrown when the mesg is null. This happens if an exception is thrown earlier with a null message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-907) NullPointerException in ErrorMsg.findSQLState
[ https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771590#action_12771590 ] Ning Zhang commented on HIVE-907: - Great! I have not really started yet. Please do upload your patch. Ning NullPointerException in ErrorMsg.findSQLState - Key: HIVE-907 URL: https://issues.apache.org/jira/browse/HIVE-907 Project: Hadoop Hive Issue Type: Bug Reporter: Zheng Shao Assignee: Ning Zhang NullPointerException is thrown when the mesg is null. This happens if an exception is thrown earlier with a null message. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HIVE-908) optimize limit
optimize limit -- Key: HIVE-908 URL: https://issues.apache.org/jira/browse/HIVE-908 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 If there is a limit, all the mappers have to finish and create 'limit' number of rows - this can be pretty expensive for a large file. The following optimizations can be performed in this area: 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows that there is a limit - so, it might be useful to increase the split size, thereby reducing the number of mappers. 2. A counter is maintained for the total outputs rows - the mappers can look at those counters and decide to exit instead of emitting 'limit' number of rows themselves. 2. may lead to some bugs because of bugs in counters, but 1. should definitely help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-840) no error if user specifies multiple columns of same name as output
[ https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Yongqiang updated HIVE-840: -- Attachment: hive-840-2009-10-29.patch Integrated Namit's suggestions. Thanks, Namit! no error if user specifies multiple columns of same name as output -- Key: HIVE-840 URL: https://issues.apache.org/jira/browse/HIVE-840 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Attachments: hive-840-2009-10-28.patch, hive-840-2009-10-29.patch INSERT OVERWRITE TABLE table_name_here SELECT TRANSFORM(key,val) USING '/script/' AS foo, foo, foo The above query should fail, but it succeeds -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-908) optimize limit
[ https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771624#action_12771624 ] He Yongqiang commented on HIVE-908: --- I have seen this before, so i guess we may already have one ticket. Will try to find out. optimize limit -- Key: HIVE-908 URL: https://issues.apache.org/jira/browse/HIVE-908 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 If there is a limit, all the mappers have to finish and create 'limit' number of rows - this can be pretty expensive for a large file. The following optimizations can be performed in this area: 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows that there is a limit - so, it might be useful to increase the split size, thereby reducing the number of mappers. 2. A counter is maintained for the total outputs rows - the mappers can look at those counters and decide to exit instead of emitting 'limit' number of rows themselves. 2. may lead to some bugs because of bugs in counters, but 1. should definitely help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-908) optimize limit
[ https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771626#action_12771626 ] He Yongqiang commented on HIVE-908: --- HIVE-588. Is this issue the same as HIVE-588? optimize limit -- Key: HIVE-908 URL: https://issues.apache.org/jira/browse/HIVE-908 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 If there is a limit, all the mappers have to finish and create 'limit' number of rows - this can be pretty expensive for a large file. The following optimizations can be performed in this area: 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows that there is a limit - so, it might be useful to increase the split size, thereby reducing the number of mappers. 2. A counter is maintained for the total outputs rows - the mappers can look at those counters and decide to exit instead of emitting 'limit' number of rows themselves. 2. may lead to some bugs because of bugs in counters, but 1. should definitely help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-908) optimize limit
[ https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771638#action_12771638 ] Ning Zhang commented on HIVE-908: - I agree that for most cases, if the limit number if small, we should reduce the number of mappers by increasing the split size. This is particularly true when the limit can be pushed down to the TableScan operator. However if the the query has joins or group-by, it could be more complicated. I think a more general solution would be to introduce a limit operator and a set of rewrite rules to push the limit operator down as much as possible. In case of reduce-side joins and groupby, we cannot push the limit operator down to the map side and it has to be on the reduce side. There are techniques that make join and groupby limit-aware in the top-k query processing techniques (the ranking function for limit is just a constant function). A survey can be found at http://www.cs.uwaterloo.ca/~ilyas/papers/IlyasTopkSurvey.pdf. optimize limit -- Key: HIVE-908 URL: https://issues.apache.org/jira/browse/HIVE-908 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 If there is a limit, all the mappers have to finish and create 'limit' number of rows - this can be pretty expensive for a large file. The following optimizations can be performed in this area: 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows that there is a limit - so, it might be useful to increase the split size, thereby reducing the number of mappers. 2. A counter is maintained for the total outputs rows - the mappers can look at those counters and decide to exit instead of emitting 'limit' number of rows themselves. 2. may lead to some bugs because of bugs in counters, but 1. should definitely help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-908) optimize limit
[ https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771645#action_12771645 ] Namit Jain commented on HIVE-908: - In general, if the limit is happening at the reducer, it is not much of a problem, since the number of reducers are usually not that large. There is already a limit operator - we can work on pushing it up as well, but both these approaches seem independent. optimize limit -- Key: HIVE-908 URL: https://issues.apache.org/jira/browse/HIVE-908 Project: Hadoop Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Fix For: 0.5.0 If there is a limit, all the mappers have to finish and create 'limit' number of rows - this can be pretty expensive for a large file. The following optimizations can be performed in this area: 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows that there is a limit - so, it might be useful to increase the split size, thereby reducing the number of mappers. 2. A counter is maintained for the total outputs rows - the mappers can look at those counters and decide to exit instead of emitting 'limit' number of rows themselves. 2. may lead to some bugs because of bugs in counters, but 1. should definitely help -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-840) no error if user specifies multiple columns of same name as output
[ https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain resolved HIVE-840. - Resolution: Fixed Fix Version/s: 0.5.0 Hadoop Flags: [Reviewed] Committed. Thanks Yongqiang no error if user specifies multiple columns of same name as output -- Key: HIVE-840 URL: https://issues.apache.org/jira/browse/HIVE-840 Project: Hadoop Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: He Yongqiang Fix For: 0.5.0 Attachments: hive-840-2009-10-28.patch, hive-840-2009-10-29.patch INSERT OVERWRITE TABLE table_name_here SELECT TRANSFORM(key,val) USING '/script/' AS foo, foo, foo The above query should fail, but it succeeds -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.