[
https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253308#comment-13253308
]
Thomas Jungblut commented on HAMA-540:
--------------------------------------
bq.1) It would be good to have the initial (p-1) pivots evenly distributed over
the input data. This way the processors take almost same time to complete the
task. How are you planning to get them?
Yes that is the major point in this algorithm, I use the same naive way to get
the pivots. There is this median in linear time algorithm from Tarjan and
Rivest (
http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
.
However noone is going to iterative over 1TB of data to determine a good pivot.
When you have big data the chances that the data is evenly distributed over the
processes by picking the first n-elements is quite high.
bq. 2) Slightly less than double the size of the input data is passed between
the processors in the sampling sort. If the input is 1TB, then ~2TB is
transferred between the nodes. Not sure if this is OK? It would be nice to know
how much data is transferred in merge sort and quick sort.
That is the key why I think this is not very efficient.
bq. I am just getting a feel of Hama/BSP and thought of implementing sampling
sort. Please do it. I will pick the next easy one
I'm sorry! You can also implement sampling sort, maybe you interpret the paper
other than me.
> Create distributed sort BSP
> ---------------------------
>
> Key: HAMA-540
> URL: https://issues.apache.org/jira/browse/HAMA-540
> Project: Hama
> Issue Type: New Feature
> Components: bsp, examples
> Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks
> this could be as well practical.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira