GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/3827
HiveTableScan return mutable row with copy https://issues.apache.org/jira/browse/SPARK-4963 SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row. HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating. override def next(): T = { val r = data.next() advance r } GapSamplingIterator.next() return the current underlying element and assigned it to r. However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object. After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r. To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result. Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark spark-4963 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3827.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3827 ---- commit 6eaee5e7b1b5aca7f6abd16892f8312c7d6d7917 Author: Yanbo Liang <yanboha...@gmail.com> Date: 2014-12-29T09:00:44Z HiveTableScan return mutable row with copy ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org