GitHub user mizitch opened a pull request: https://github.com/apache/incubator-beam/pull/1327
[BEAM-840] Some minor changes and fixes for sorter module. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-<Jira issue #>] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `<Jira issue #>` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Includes: * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent int overflow within Hadoop's sorting library * Fix int overflow for large memory values in InMemorySorter * Add note about estimated disk use to README.MD * Fix to make Hadoop's sorting library put all temp files under the specified directory * Have Hadoop clean up the temp directory on exit * Stop shading hadoop dependencies. Some context: ** The existing shading is broken (modules that depend on this one cannot use it successfully). ** Hadoop's use of reflection in several instances makes shading the dependency "in a good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet its intended goal of preventing conflicts anyway. ** From what I can tell, there's no good way to shade this to make it universally usable, so leaving it unshaded seems like a reasonable default. ** Without shading Hadoop, this module can be successfully used from Beam's wordcount example (which actually does have pre-existing hadoop dependencies already). You can merge this pull request into a Git repository by running: $ git pull https://github.com/mizitch/incubator-beam sorter-gcs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1327 ---- commit d07c4ce9349abac4d0c53223072f1c84a1dc98c6 Author: Mitch Shanklin <mshank...@google.com> Date: 2016-11-09T22:09:49Z Some minor changes and fixes for sorter module. Includes: * Limit max memory for ExternalSorter and BufferedExternalSorter to 2047 MB to prevent int overflow within Hadoop's sorting library * Fix int overflow for large memory values in InMemorySorter * Add note about estimated disk use to README.MD * Fix to make Hadoop's sorting library put all temp files under the specified directory * Have Hadoop clean up the temp directory on exit * Stop shading hadoop dependencies. Some context: ** The existing shading is broken (modules that depend on this one cannot use it successfully). ** Hadoop's use of reflection in several instances makes shading the dependency "in a good way" nearly impossible. It requires a couple of rather brittle hacks, and, for clients that depend on certain conflicting versions of hadoop these hacks can mean it doesn't meet its intended goal of preventing conflicts anyway. ** From what I can tell, there's no good way to shade this to make it universally usable, so leaving it unshaded seems like a reasonable default. ** Without shading Hadoop, this module can be successfully used from Beam's wordcount example (which actually does have pre-existing hadoop dependencies already). ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---