date:20091029

split size should be increased for map-joins


 Key: HIVE-906
 URL: https://issues.apache.org/jira/browse/HIVE-906
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.5.0


Had a offline discussion with Ning and Dhruba on this. It would be good to have 
a larger split size for the big table for map-joins. 
It can be a function of the size of the small table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-840) no error if user specifies multiple columns of same name as output


 [ 
https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Namit Jain reassigned HIVE-840:
---

Assignee: He Yongqiang

Looks good - just one minor comment.

Can you add the error message in ErrorMsg.java and then use it

 no error if user specifies multiple columns of same name as output
 --

 Key: HIVE-840
 URL: https://issues.apache.org/jira/browse/HIVE-840
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Attachments: hive-840-2009-10-28.patch


 INSERT OVERWRITE TABLE table_name_here
 SELECT TRANSFORM(key,val)
 USING '/script/'
 AS foo, foo, foo
 The above query should fail, but it succeeds

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers

[
https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771544#action_12771544
]

Ning Zhang commented on HIVE-900:
-

The essential problem is that there are too many mappers are trying to
accessing the same block at the same time so that it exceeds the threshold of
accessing the same block. Thus the BlockMissingException is thrown.

Discussed with Namit and Dhruba offline. There are the proposed solutions:

1) Make the HDFS fault tolerant to this issue. Dhruba mentioned there already
exists retry logic implemented in the DFS client code: if the
BlockMissingException is throw it will wait about 400ms and retry. If there are
still exceptions then wait for 800 ms and so on until 5 unsuccessful retry.
This mechanism works for non-correlated simultaneous request of the same block.
However in this case, almost all the mappers request the same block at the same
time, so their retries will be also at about the same time. So it would be
better to introduce a random factor into the wait time. Dhruba will look into
the DFS code and working on that. This will solve a broader type of issues
beside the map-side join.

2) Another orthogonal issue brought up by Namit for map-side join is that if
there are too many mappers and each of them request the same small table, it
comes with a cost of transferring the small file to all these mappers. Even
though the BlockMissingException is resolved, the cost is still there and it is
proportional to the number of mappers. In this respect it would be better to
reduce the number of mappers. But it also comes with the cost that each mappers
then has to deal with larger portion of the large table. So we have to tradeoff
the network cost of the small table and the processing cost of the large table.
Will come with a heuristic on tune the parameters to decide the number of
mappers for map join.

Map-side join failed if there are large number of mappers
-

Key: HIVE-900
URL: https://issues.apache.org/jira/browse/HIVE-900
Project: Hadoop Hive
Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Ning Zhang

Map-side join is efficient when joining a huge table with a small table so
that the mapper can read the small table into main memory and do join on each
mapper. However, if there are too many mappers generated for the map join, a
large number of mappers will simultaneously send request to read the same
block of the small table. Currently Hadoop has a upper limit of the # of
request of a the same block (250?). If that is reached a
BlockMissingException will be thrown. That cause a lot of mappers been
killed. Retry won't solve but worsen the problem.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-907) NullPointerException in ErrorMsg.findSQLState

2009-10-29 Thread Zheng Shao (JIRA)

NullPointerException in ErrorMsg.findSQLState
-

 Key: HIVE-907
 URL: https://issues.apache.org/jira/browse/HIVE-907
 Project: Hadoop Hive
  Issue Type: Bug
Reporter: Zheng Shao


NullPointerException is thrown when the mesg is null.
This happens if an exception is thrown earlier with a null message.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers

2009-10-29 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771551#action_12771551
 ] 

Prasad Chakka commented on HIVE-900:


just a of the wall idea, temporarily increase the replication factor for this 
block so that it is available in more racks thus reducing the network cost and 
also avoiding BlockMissingException. ofcourse, we need to find a way to 
reliably set the replication factor back to original setting.

 Map-side join failed if there are large number of mappers
 -

 Key: HIVE-900
 URL: https://issues.apache.org/jira/browse/HIVE-900
 Project: Hadoop Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Ning Zhang

 Map-side join is efficient when joining a huge table with a small table so 
 that the mapper can read the small table into main memory and do join on each 
 mapper. However, if there are too many mappers generated for the map join, a 
 large number of mappers will simultaneously send request to read the same 
 block of the small table. Currently Hadoop has a upper limit of the # of 
 request of a the same block (250?). If that is reached a 
 BlockMissingException will be thrown. That cause a lot of mappers been 
 killed. Retry won't solve but worsen the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers

2009-10-29 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771553#action_12771553
 ] 

Prasad Chakka commented on HIVE-900:


@venky, may be you can unblock your work by manually increasing the replication 
factory to very high and then issuing the query?

 Map-side join failed if there are large number of mappers
 -

 Key: HIVE-900
 URL: https://issues.apache.org/jira/browse/HIVE-900
 Project: Hadoop Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Ning Zhang

 Map-side join is efficient when joining a huge table with a small table so 
 that the mapper can read the small table into main memory and do join on each 
 mapper. However, if there are too many mappers generated for the map join, a 
 large number of mappers will simultaneously send request to read the same 
 block of the small table. Currently Hadoop has a upper limit of the # of 
 request of a the same block (250?). If that is reached a 
 BlockMissingException will be thrown. That cause a lot of mappers been 
 killed. Retry won't solve but worsen the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-907) NullPointerException in ErrorMsg.findSQLState


 [ 
https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ning Zhang reassigned HIVE-907:
---

Assignee: Ning Zhang

 NullPointerException in ErrorMsg.findSQLState
 -

 Key: HIVE-907
 URL: https://issues.apache.org/jira/browse/HIVE-907
 Project: Hadoop Hive
  Issue Type: Bug
Reporter: Zheng Shao
Assignee: Ning Zhang

 NullPointerException is thrown when the mesg is null.
 This happens if an exception is thrown earlier with a null message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-840) no error if user specifies multiple columns of same name as output


[ 
https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771562#action_12771562
 ] 

He Yongqiang commented on HIVE-840:
---

Thanks, Namit. 
You mean the error message column alias  + name +  already exists.  We 
need a parameter 'name' in the error message, so it may not fit in ErrorMsg.

 no error if user specifies multiple columns of same name as output
 --

 Key: HIVE-840
 URL: https://issues.apache.org/jira/browse/HIVE-840
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Attachments: hive-840-2009-10-28.patch


 INSERT OVERWRITE TABLE table_name_here
 SELECT TRANSFORM(key,val)
 USING '/script/'
 AS foo, foo, foo
 The above query should fail, but it succeeds

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-900) Map-side join failed if there are large number of mappers


[ 
https://issues.apache.org/jira/browse/HIVE-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771569#action_12771569
 ] 

Ning Zhang commented on HIVE-900:
-

@parasad, yes that's definitely a good idea to scale out mapjoin with a large 
number of mappers. Dhruba also suggested to increase the replication factor for 
the small file. But as you mentioned, we need to revert the replication factor 
before mapjoin finishes or any exception is caught. I'll also investigate that. 

 Map-side join failed if there are large number of mappers
 -

 Key: HIVE-900
 URL: https://issues.apache.org/jira/browse/HIVE-900
 Project: Hadoop Hive
  Issue Type: Improvement
Reporter: Ning Zhang
Assignee: Ning Zhang

 Map-side join is efficient when joining a huge table with a small table so 
 that the mapper can read the small table into main memory and do join on each 
 mapper. However, if there are too many mappers generated for the map join, a 
 large number of mappers will simultaneously send request to read the same 
 block of the small table. Currently Hadoop has a upper limit of the # of 
 request of a the same block (250?). If that is reached a 
 BlockMissingException will be thrown. That cause a lot of mappers been 
 killed. Retry won't solve but worsen the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-840) no error if user specifies multiple columns of same name as output


[ 
https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771573#action_12771573
 ] 

Namit Jain commented on HIVE-840:
-

There are existing error messages which are used with parameters: look at 
TABLE_ALREADY_EXISTS.
Cant you follow the same approach ? The only restriction is that the column 
name will appear at the end,
which might be acceptable.

 no error if user specifies multiple columns of same name as output
 --

 Key: HIVE-840
 URL: https://issues.apache.org/jira/browse/HIVE-840
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Attachments: hive-840-2009-10-28.patch


 INSERT OVERWRITE TABLE table_name_here
 SELECT TRANSFORM(key,val)
 USING '/script/'
 AS foo, foo, foo
 The above query should fail, but it succeeds

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-907) NullPointerException in ErrorMsg.findSQLState

2009-10-29 Thread Bill Graham (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771582#action_12771582
 ] 

Bill Graham commented on HIVE-907:
--

Unless you're already working on this, I have a patch that I can submit. Just 
let me know. As I'm sure you're aware, It's a quick fix.

 NullPointerException in ErrorMsg.findSQLState
 -

 Key: HIVE-907
 URL: https://issues.apache.org/jira/browse/HIVE-907
 Project: Hadoop Hive
  Issue Type: Bug
Reporter: Zheng Shao
Assignee: Ning Zhang

 NullPointerException is thrown when the mesg is null.
 This happens if an exception is thrown earlier with a null message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-907) NullPointerException in ErrorMsg.findSQLState


[ 
https://issues.apache.org/jira/browse/HIVE-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771590#action_12771590
 ] 

Ning Zhang commented on HIVE-907:
-

Great! I have not really started yet. Please do upload your patch.

Ning



 NullPointerException in ErrorMsg.findSQLState
 -

 Key: HIVE-907
 URL: https://issues.apache.org/jira/browse/HIVE-907
 Project: Hadoop Hive
  Issue Type: Bug
Reporter: Zheng Shao
Assignee: Ning Zhang

 NullPointerException is thrown when the mesg is null.
 This happens if an exception is thrown earlier with a null message.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HIVE-908) optimize limit

optimize limit
--

 Key: HIVE-908
 URL: https://issues.apache.org/jira/browse/HIVE-908
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.5.0


If there is a limit, all the mappers have to finish and create 'limit' number 
of rows - this can be pretty expensive for a large file.

The following optimizations can be performed in this area:

1. Start fewer mappers if there is a limit - before submitting a job, the 
compiler knows that there is a limit - so, it might be useful to increase the 
split size, thereby reducing the number of mappers.
2. A counter is maintained for the total outputs rows - the mappers can look at 
those counters and decide to exit instead of emitting 'limit' number of rows 
themselves.

2. may lead to some bugs because of bugs in counters, but 1. should definitely 
help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-840) no error if user specifies multiple columns of same name as output


 [ 
https://issues.apache.org/jira/browse/HIVE-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Yongqiang updated HIVE-840:
--

Attachment: hive-840-2009-10-29.patch

Integrated Namit's suggestions. Thanks, Namit!

 no error if user specifies multiple columns of same name as output
 --

 Key: HIVE-840
 URL: https://issues.apache.org/jira/browse/HIVE-840
 Project: Hadoop Hive
  Issue Type: Bug
  Components: Query Processor
Reporter: Namit Jain
Assignee: He Yongqiang
 Attachments: hive-840-2009-10-28.patch, hive-840-2009-10-29.patch


 INSERT OVERWRITE TABLE table_name_here
 SELECT TRANSFORM(key,val)
 USING '/script/'
 AS foo, foo, foo
 The above query should fail, but it succeeds

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-908) optimize limit


[ 
https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771624#action_12771624
 ] 

He Yongqiang commented on HIVE-908:
---

I have seen this before, so i guess we may already have one ticket. Will try to 
find out.

 optimize limit
 --

 Key: HIVE-908
 URL: https://issues.apache.org/jira/browse/HIVE-908
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.5.0


 If there is a limit, all the mappers have to finish and create 'limit' number 
 of rows - this can be pretty expensive for a large file.
 The following optimizations can be performed in this area:
 1. Start fewer mappers if there is a limit - before submitting a job, the 
 compiler knows that there is a limit - so, it might be useful to increase the 
 split size, thereby reducing the number of mappers.
 2. A counter is maintained for the total outputs rows - the mappers can look 
 at those counters and decide to exit instead of emitting 'limit' number of 
 rows themselves.
 2. may lead to some bugs because of bugs in counters, but 1. should 
 definitely help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-908) optimize limit


[ 
https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771626#action_12771626
 ] 

He Yongqiang commented on HIVE-908:
---

HIVE-588.
Is this issue the same as HIVE-588?

 optimize limit
 --

 Key: HIVE-908
 URL: https://issues.apache.org/jira/browse/HIVE-908
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.5.0


 If there is a limit, all the mappers have to finish and create 'limit' number 
 of rows - this can be pretty expensive for a large file.
 The following optimizations can be performed in this area:
 1. Start fewer mappers if there is a limit - before submitting a job, the 
 compiler knows that there is a limit - so, it might be useful to increase the 
 split size, thereby reducing the number of mappers.
 2. A counter is maintained for the total outputs rows - the mappers can look 
 at those counters and decide to exit instead of emitting 'limit' number of 
 rows themselves.
 2. may lead to some bugs because of bugs in counters, but 1. should 
 definitely help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-908) optimize limit

[
https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771638#action_12771638
]

Ning Zhang commented on HIVE-908:
-

I agree that for most cases, if the limit number if small, we should reduce the
number of mappers by increasing the split size. This is particularly true when
the limit can be pushed down to the TableScan operator. However if the the
query has joins or group-by, it could be more complicated.

I think a more general solution would be to introduce a limit operator and a
set of rewrite rules to push the limit operator down as much as possible. In
case of reduce-side joins and groupby, we cannot push the limit operator down
to the map side and it has to be on the reduce side. There are techniques that
make join and groupby limit-aware in the top-k query processing techniques (the
ranking function for limit is just a constant function). A survey can be found
at http://www.cs.uwaterloo.ca/~ilyas/papers/IlyasTopkSurvey.pdf.

optimize limit
--

Key: HIVE-908
URL: https://issues.apache.org/jira/browse/HIVE-908
Project: Hadoop Hive
Issue Type: Improvement
Components: Query Processor
Reporter: Namit Jain
Fix For: 0.5.0

If there is a limit, all the mappers have to finish and create 'limit' number
of rows - this can be pretty expensive for a large file.
The following optimizations can be performed in this area:
1. Start fewer mappers if there is a limit - before submitting a job, the
compiler knows that there is a limit - so, it might be useful to increase the
split size, thereby reducing the number of mappers.
2. A counter is maintained for the total outputs rows - the mappers can look
at those counters and decide to exit instead of emitting 'limit' number of
rows themselves.
2. may lead to some bugs because of bugs in counters, but 1. should
definitely help

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-908) optimize limit


[ 
https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771645#action_12771645
 ] 

Namit Jain commented on HIVE-908:
-

In general, if the limit is happening at the reducer, it is not much of a 
problem, since the number of reducers are usually not that large. 
There is already a limit operator  - we can work on pushing it up as well, but 
both these approaches seem independent.

 optimize limit
 --

 Key: HIVE-908
 URL: https://issues.apache.org/jira/browse/HIVE-908
 Project: Hadoop Hive
  Issue Type: Improvement
  Components: Query Processor
Reporter: Namit Jain
 Fix For: 0.5.0


 If there is a limit, all the mappers have to finish and create 'limit' number 
 of rows - this can be pretty expensive for a large file.
 The following optimizations can be performed in this area:
 1. Start fewer mappers if there is a limit - before submitting a job, the 
 compiler knows that there is a limit - so, it might be useful to increase the 
 split size, thereby reducing the number of mappers.
 2. A counter is maintained for the total outputs rows - the mappers can look 
 at those counters and decide to exit instead of emitting 'limit' number of 
 rows themselves.
 2. may lead to some bugs because of bugs in counters, but 1. should 
 definitely help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-840) no error if user specifies multiple columns of same name as output