[jira] Commented: (HIVE-1641) add map joined table to distributed cache

2010-10-06 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918764#action_12918764
 ] 

He Yongqiang commented on HIVE-1641:


There are 2 patches with the same name. Can you delete the older one? And when 
uploading a patch, pls rename the patch to 
hive-jiranumber.patchnumberordate.patch.


> add map joined table to distributed cache
> -
>
> Key: HIVE-1641
> URL: https://issues.apache.org/jira/browse/HIVE-1641
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 0.7.0
>Reporter: Namit Jain
>Assignee: Liyin Tang
> Fix For: 0.7.0
>
> Attachments: Hive-1641.patch, Hive-1641.patch
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which 
> makes it difficult to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a 
> few thousand, due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read 
> from there instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1641) add map joined table to distributed cache

2010-10-05 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918374#action_12918374
 ] 

Liyin Tang commented on HIVE-1641:
--

The previously assumption is not always true. There might be multiple map join 
operations in one local work. 

No matter how many map join operators in one Map Red Task, for each map join 
operator, there will be one parent operator from big table branch and other 
operators from small table branches.
For big table branch, just leave it alone.

For small table branch, create a new JDBMSinkOperator to replace the current 
MapJoin Operator. Now the local work has no common operators shared with the 
MapredWork.  
And create a JDBMDummyOperator to replace original parent operator for the 
MapJoinOperator. 
This JDBMDummyOperator will help MapJoinOperator generate correctly input 
object inspector during the run time.

In the execution time, the LocalTask will process all the local work and 
generate the JDBM file for each small tables. 
When the MapRedTask starts to process the first row for MapJoinOperator, it 
will load the JDBM file to generate the in-memory hash table.

If in the local mode, the JDBM files will be just stored in local directory. If 
not, the jdbm files will be added into Distributed Cache.

This patch is just tested on Local Mode. I will submit another patch after 
testing against the clusters.


> add map joined table to distributed cache
> -
>
> Key: HIVE-1641
> URL: https://issues.apache.org/jira/browse/HIVE-1641
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Liyin Tang
> Fix For: 0.7.0
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which 
> makes it difficult to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a 
> few thousand, due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read 
> from there instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1641) add map joined table to distributed cache

2010-09-23 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914332#action_12914332
 ] 

Liyin Tang commented on HIVE-1641:
--

Right now, the local work is only for processing small tables for map join 
operation. Also one MapredTask can at most have one map join operation. Because 
if one map join followed by anther map join, they will be split into 2 tasks. 
So one MapredTask can at most one local work to do. 

One feasible solution is to create a new type of task, named MapredLocalTask, 
which is to do some MapredLocalWork (local work). If one MapredTask has a local 
work to do, then create a new MapredLocal Task for this local work, let the 
current MapredTask depends on this new generated Task, and let this new 
generated task depends on the parent tasks of the current task.

In this new MapredLocalTask, it does the local work only once and generate the 
mapped file(JDBM file). Next step is to put the new generated mapped file into 
distributed cache. All the mappers will 
read this file from the distributed cache and construct the in memory hash 
table based on this file.

Any comments are so welcome:)


> add map joined table to distributed cache
> -
>
> Key: HIVE-1641
> URL: https://issues.apache.org/jira/browse/HIVE-1641
> Project: Hadoop Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: Namit Jain
>Assignee: Liyin Tang
> Fix For: 0.7.0
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which 
> makes it difficult to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a 
> few thousand, due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read 
> from there instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.