GitHub user mateiz opened a pull request: https://github.com/apache/spark/pull/1722
SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter All these changes are from @mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mateiz/spark spark-2792 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1722.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1722 ---- commit 9a78e4b2fdf6ca20667aca478e39ad7fa5a34e11 Author: Matei Zaharia <ma...@databricks.com> Date: 2014-08-01T22:02:13Z Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes All these changes are from @mridulm's work in #1609, but extracted here to fix this specific issue. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. commit 0d6dad7dc0f38cc7accb89b75c848a2b31fe254c Author: Matei Zaharia <ma...@databricks.com> Date: 2014-08-01T22:12:28Z Added Mridul's test changes for ExternalAppendOnlyMap commit bda37bb431d44c11c097497fd389d6ab2b97c69c Author: Matei Zaharia <ma...@databricks.com> Date: 2014-08-01T22:38:01Z Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too Modified ExternalSorterSuite to also set a low object stream reset and batch size, and verified that it failed before the changes and succeeded after. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---