[
https://issues.apache.org/jira/browse/HAMA-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13253302#comment-13253302
]
Thomas Jungblut commented on HAMA-540:
--------------------------------------
Here's my first prototype:
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/bsp/SamplingSort.java
I am really astonished that this works-
You see that the pivotting is a bit naive, because the distribution is totally
not even (that -> mapping between the logs).
{noformat}
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total input paths to process : 1
12/04/13 13:39:19 INFO bsp.FileInputFormat: Total # of splits: 7
12/04/13 13:39:19 WARN bsp.BSPJobClient: No job jar file set. User classes may
not be found. See BSPJob#setJar(String) or check Your jar file.
12/04/13 13:39:20 INFO bsp.BSPJobClient: Running job: job_localrunner_0001
12/04/13 13:39:22 INFO bsp.LocalBSPRunner: Setting up a new barrier for 7 tasks!
local:6 -> 176
local:2 -> 133
local:5 -> 189
local:0 -> 113
local:3 -> 92
local:4 -> 29
local:1 -> 78
12/04/13 13:39:23 INFO bsp.BSPJobClient: Current supersteps number: 1
12/04/13 13:39:23 INFO bsp.BSPJobClient: The total number of supersteps: 1
from file:/tmp/hama-sampling-out/part-00000
-2145373038 -2135777393 -2127418941 -2127349118 -2116694526 -2112753401
-2111019858 -2109843938 -2109467658
-1775154178 -1771096268 -1768609402 -1767599475 -1753155542 -1744884630
-1736545907 -1734220768 -1727656934
-1727161209 -1724429198 -1712603905 -1711206669 -1693536736
from file:/tmp/hama-sampling-out/part-00001
-1684778946 -1683715271 -1677988183 -1673772158 -1672941153 -1669199897
-1661791404 -1660526886 -1658572801
-1579967204 -1577470192 -1569276585
<rest omitted>
{noformat}
However I very much doubt that the algorithm is faster than MapReduce. I think
we can use the Quicksort class in Hadoop to further optimize, I used Java7's
new Timsort in an Arrays.sort() because it is in-place. To get there, I have a
huge collections overhead and RAM usage.
But the idea of the algorithm is very cool.
> Create distributed sort BSP
> ---------------------------
>
> Key: HAMA-540
> URL: https://issues.apache.org/jira/browse/HAMA-540
> Project: Hama
> Issue Type: New Feature
> Components: bsp, examples
> Reporter: Thomas Jungblut
>
> For HAMA-535 we need some kind of sort framework, for various other tasks
> this could be as well practical.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira