GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/1722

    SPARK-2792. Fix reading too much or too little data from each stream in 
ExternalMap / Sorter

    All these changes are from @mridulm's work in #1609, but extracted here to 
fix this specific issue and make it easier to merge not 1.1. This particular 
set of changes is to make sure that we read exactly the right range of bytes 
from each spill file in EAOM: some serializers can write bytes after the last 
object (e.g. the TC_RESET flag in Java serialization) and that would confuse 
the previous code into reading it as part of the next batch. There are also 
improvements to cleanup to make sure files are closed.
    
    In addition to bringing in the changes to ExternalAppendOnlyMap, I also 
copied them to the corresponding code in ExternalSorter and updated its test 
suite to test for the same issues.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark spark-2792

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1722.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1722
    
----
commit 9a78e4b2fdf6ca20667aca478e39ad7fa5a34e11
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-01T22:02:13Z

    Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
    
    All these changes are from @mridulm's work in #1609, but extracted here
    to fix this specific issue. This particular set of changes is to make
    sure that we read exactly the right range of bytes from each spill file
    in EAOM: some serializers can write bytes after the last object (e.g.
    the TC_RESET flag in Java serialization) and that would confuse the
    previous code into reading it as part of the next batch. There are also
    improvements to cleanup to make sure files are closed.

commit 0d6dad7dc0f38cc7accb89b75c848a2b31fe254c
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-01T22:12:28Z

    Added Mridul's test changes for ExternalAppendOnlyMap

commit bda37bb431d44c11c097497fd389d6ab2b97c69c
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-08-01T22:38:01Z

    Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
    
    Modified ExternalSorterSuite to also set a low object stream reset and
    batch size, and verified that it failed before the changes and succeeded
    after.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to