[ https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-4902: ----------------------------- Target Version/s: (was: 1.2.0, 1.3.0) > gap-sampling performance optimization > ------------------------------------- > > Key: SPARK-4902 > URL: https://issues.apache.org/jira/browse/SPARK-4902 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.2.0 > Reporter: Guoqiang Li > > {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator > that contains an array or a iterator(when the memory is not enough). > The GapSamplingIterator implementation is as follows > {code} > private val iterDrop: Int => Unit = { > val arrayClass = Array.empty[T].iterator.getClass > val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass > data.getClass match { > case `arrayClass` => ((n: Int) => { data = data.drop(n) }) > case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) }) > case _ => ((n: Int) => { > var j = 0 > while (j < n && data.hasNext) { > data.next() > j += 1 > } > }) > } > } > {code} > The code does not deal with InterruptibleIterator. > This leads to the following code can't use the {{Iterator.drop}} method > {code} > rdd.cache() > rdd.sample(false,0.1) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org