[
https://issues.apache.org/jira/browse/TAJO-982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260913#comment-14260913
]
Jihoon Son edited comment on TAJO-982 at 12/30/14 9:07 AM:
-----------------------------------------------------------
Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into
larger ones. I think that this is not feasible because it requires that the
task assignment should be considered when merging files, thereby causing static
task assignment.
* As described in this issue, we can improve fetchers to get multiple files via
a request. This approach subsequently introduces another issue related to the
transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and
PullServers to handle an HTTP request for multiple files. For example, a
Fetcher can request a virtual HTTP address that indicates multiple files. A
PullServer who receives that request can extract real file names from the
virtual address, and then dynamically merge those files into a single file and
send it.
** Using an alternative transmission protocol that natively supports the
transmission of multiple files via a request.
I think the last one is the best approach, but I don't still have much
background for that.
What do you think of these approaches?
was (Author: jihoonson):
Hi guys, I have two ideas for this issue.
* When writing intermediate data for shuffle, we can merge small files into
larger ones. I think that this is not feasible because it requires that the
task assignment should be considered when merging files, thereby causing static
task assignment.
* As described in this issue, we can improve fetchers to get multiple files via
a request. This approach subsequently introduces another issue related to the
transmission protocol. I'm also considering two approaches as follows:
** Using HTTP as in the current implementation, but improves the Fetchers and
PullServers to handle an HTTP request for multiple files. For example, a
Fetcher can request a virtual HTTP address that indicates multiple files. A
PullServer who receives that request can extract real file names from the
virtual address, and then dynamically merge those files into one file and send
it.
** Using an alternative transmission protocol that natively supports the
transmission of multiple files via a request.
I think the last one is the best approach, but I don't still have much
background for that.
What do you think of these approaches?
> Improve Fetcher to get multiple shuffle outputs through a request
> -----------------------------------------------------------------
>
> Key: TAJO-982
> URL: https://issues.apache.org/jira/browse/TAJO-982
> Project: Tajo
> Issue Type: Improvement
> Components: data shuffle
> Reporter: Hyunsik Choi
> Assignee: Jihoon Son
> Fix For: 0.10
>
>
> Currently, Fetcher only can request at most a fetch for one shuffle output at
> a time. The implementation can cause performance degradation even though
> intermediate data is actually small.
> For example, If an input data set of the first stage is big and the
> intermediate data is very small, QueryMaster will choose a few of nodes for
> next execution block. Since each node keeps limited threads for fetch, it
> will take a long time for the nodes for next stage to fetch all intermediate.
> If Fetcher can get multiple shuffle outputs through a request, it would solve
> the slowness which occurs in some cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)