Hi there, I was able to finally identify the bug as StreamBuffer.compareTo method's ill defined behavior when key's hashCode equals to Int.MaxValue. Though this only occur in aboue 1/2^32 chance, it can happen a lot when your key size approach 2^32. I have create a pull request for the bug fix https://github.com/apache/incubator-spark/pull/612
Best Regards, Jiacheng Guo On Mon, Jan 27, 2014 at 2:36 PM, guojc <guoj...@gmail.com> wrote: > Hi Patrick, > I have create the jira > https://spark-project.atlassian.net/browse/SPARK-1045. It turn out the > situation is related to join two large rdd, not related to the combine > process as previous thought. > > Best Regards, > Jiacheng Guo > > > On Mon, Jan 27, 2014 at 11:07 AM, guojc <guoj...@gmail.com> wrote: > >> Hi Patrick, >> I think this might be data related and about edge condition handling >> as I only get a single partition repeatedly throw exception on >> externalappendonlymap's iterator. I will file a jira as soon as I can >> isolate the problem. Btw, the test is intentionally abuse the external sort >> to see its performance impact on real application, because I have trouble >> to configure a right partition number for each dataset. >> >> Best Regards, >> Jiacheng Guo >> >> >> On Mon, Jan 27, 2014 at 6:16 AM, Patrick Wendell <pwend...@gmail.com>wrote: >> >>> Hey There, >>> >>> So one thing you can do is disable the external sorting, this should >>> preserve the behavior exactly was it was in previous releases. >>> >>> It's quite possible that the problem you are having relates to the >>> fact that you have individual records that are 1GB in size. This is a >>> pretty extreme case that may violate assumptions in the implementation >>> of the external aggregation code. >>> >>> Would you mind opening a Jira for this? Also, if you are able to find >>> an isolated way to recreate the behavior it will make it easier to >>> debug and fix. >>> >>> IIRC, even with external aggregation Spark still materializes the >>> final combined output *for a given key* in memory. If you are >>> outputting GB of data for a single key, then you might also look into >>> a different parallelization strategy for your algorithm. Not sure if >>> this is also an issue though... >>> >>> - Patrick >>> >>> On Sun, Jan 26, 2014 at 2:27 AM, guojc <guoj...@gmail.com> wrote: >>> > Hi Patrick, >>> > I still get the exception on lastest master >>> > 05be7047744c88e64e7e6bd973f9bcfacd00da5f. A bit more info on the >>> subject. >>> > I'm using KryoSerialzation with a custom serialization function, and >>> the >>> > exception come from a rdd operation >>> > >>> combineByKey(createDict,combineKey,mergeDict,partitioner,true,"org.apache.spark.serializer.KryoSerializer"). >>> > All previous operation seems ok. The only difference is that this >>> operation >>> > can generate some a large dict object around 1 gb size. I hope this >>> can >>> > give you some clue what might go wrong. I'm still having trouble >>> figure out >>> > the cause. >>> > >>> > Thanks, >>> > Jiacheng Guo >>> > >>> > >>> > On Wed, Jan 22, 2014 at 1:36 PM, Patrick Wendell <pwend...@gmail.com> >>> wrote: >>> >> >>> >> This code has been modified since you reported this so you may want to >>> >> try the current master. >>> >> >>> >> - Patrick >>> >> >>> >> On Mon, Jan 20, 2014 at 4:22 AM, guojc <guoj...@gmail.com> wrote: >>> >> > Hi, >>> >> > I'm tring out lastest master branch of spark for the exciting >>> external >>> >> > hashmap feature. I have a code that is running correctly at spark >>> 0.8.1 >>> >> > and >>> >> > I only make a change for its easily to be spilled to disk. However, >>> I >>> >> > encounter a few task failure of >>> >> > java.util.NoSuchElementException (java.util.NoSuchElementException) >>> >> > >>> >> > >>> org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:277)org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:212)org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29) >>> >> > And the job seems to fail to recover. >>> >> > Can anyone give some suggestion on how to investigate the issue? >>> >> > Thanks,Jiacheng Guo >>> > >>> > >>> >> >> >