[
https://issues.apache.org/jira/browse/TAJO-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915810#comment-13915810
]
Hyunsik Choi commented on TAJO-613:
-----------------------------------
+1 for this issue. We absolutely need the way to handle stragglers.
Fortunately, TAJO-589 is in progress. It enables QueryMaster to track the
progresses of tasks. The feature of TAJO-589 allows QueryMaster to detect
unexpected slowness of tasks which may occur in large clusters. I believe that
we can do straggler handling after TAJO-589.
> Hedging against unusually slow TajoWorker
> -----------------------------------------
>
> Key: TAJO-613
> URL: https://issues.apache.org/jira/browse/TAJO-613
> Project: Tajo
> Issue Type: Improvement
> Reporter: Keuntae Park
>
> When one of disks in my Tajo cluster becomes not healthy (that means slow
> response time due to hardware problem), it results in extremely slow query
> processing time.
> Following is kernel log of the server that has unhealthy disk:
> {noformat}
> Feb 18 15:20:12 ceo-tajo03 kernel: sd 0:2:4:0: [sde] Unhandled error code
> Feb 18 15:20:12 ceo-tajo03 kernel: sd 0:2:4:0: [sde] Result:
> hostbyte=DID_ERROR driverbyte=DRIVER_OK
> Feb 18 15:20:12 ceo-tajo03 kernel: sd 0:2:4:0: [sde] CDB: Read(16): 88 00 00
> 00 00 01 57 ec 66 32 00 00 01 00 00 00
> ...
> {noformat}
> This problem makes TaskRunner, which normally takes less than 3 seconds for
> the given query, takes 1700 seconds, and total query execution time also
> becomes 1750 seconds, which is normally 70 seconds before.
> I think Tajo needs a mechanism like speculative execution of MapReduce.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)