GitHub user yanbohappy opened a pull request:

    https://github.com/apache/spark/pull/3827

    HiveTableScan return mutable row with copy

    https://issues.apache.org/jira/browse/SPARK-4963
    SchemaRDD.sample() return wrong results due to GapSamplingIterator 
operating on mutable row.
    HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will 
return GapSamplingIterator for iterating. 
    
    override def next(): T = {
        val r = data.next()
        advance
        r
      }
    
    GapSamplingIterator.next() return the current underlying element and 
assigned it to r.
    However if the underlying iterator is mutable row just like what 
HiveTableScan returned, underlying iterator and r will point to the same object.
    After advance operation, we drop some underlying elments and it also 
changed r which is not expected. Then we return the wrong value different from 
initial r.
    
    To fix this issue, the most direct way is to make HiveTableScan return 
mutable row with copy just like the initial commit that I have made. This 
solution will make HiveTableScan can not get the full advantage of reusable 
MutableRow, but it can make sample operation return correct result.
    Further more, we need to investigate  GapSamplingIterator.next() and make 
it can implement copy operation inside it. To achieve this, we should define 
every elements that RDD can store implement the function like cloneable and it 
will make huge change.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanbohappy/spark spark-4963

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3827.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3827
    
----
commit 6eaee5e7b1b5aca7f6abd16892f8312c7d6d7917
Author: Yanbo Liang <yanboha...@gmail.com>
Date:   2014-12-29T09:00:44Z

    HiveTableScan return mutable row with copy

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to