Thanks. This is a VERY simple example.
I have two 20 GB json files. Each line in the files has the same format. I run: val events = filter(_split(something)(get the field)).map(field => (field, 0)) on the first file I then run val events1 = the same filter on the second file and do map(field => (field, 1)) This ensures that events has form of (field, 0) and events1 has form of (field, 1) I then to val ret=events.union(events1) - this will put all the fields in the same RDD Then I do val r = ret.groupByKey().filter(e => e._2.length > 1 && e._2(0)==0) to make sure all groups with key field have at least two elements and the first one is a zero (so, for example, an entry in this structure will have form (field, (0, 1. 1, 1....)) I then just do a simple r.count Ognen On Fri, Jan 24, 2014 at 1:29 PM, 尹绪森 <yinxu...@gmail.com> wrote: > 1. Does there any in-place operation in you code? Such as addi() for > DoubleMatrix. This kind of operation will affect the original data. > > 2. You could try to use Spark replay debugger, there is a assert function. > Hope that helpful. > http://spark-replay-debugger-overview.readthedocs.org/en/latest/ > > > 2014/1/24 Ognen Duzlevski <og...@plainvanillagames.com> > >> No. It is a filter that splits a line in a json file and extracts a >> position for it - every run is the same. >> >> That's what bothers me about this. >> >> Ognen >> >> >> On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <yinxu...@gmail.com> wrote: >> >>> Does there are some non-deterministic codes in filter ? Such as >>> Random.nextInt(). If so, the program lost the idempotent feature. You >>> should specify a seed to it. >>> >>> >>> 2014/1/24 Ognen Duzlevski <og...@nengoiksvelzud.com> >>> >>>> Hello, >>>> >>>> (Sorry for the sensationalist title) :) >>>> >>>> If I run Spark on files from S3 and do basic transformation like: >>>> >>>> textfile() >>>> filter >>>> groupByKey >>>> count >>>> >>>> I get one number (e.g. 40,000). >>>> >>>> If I do the same on the same files from HDFS, the number spat out is >>>> completely different (VERY different - something like 13,000). >>>> >>>> What would one do in a situation like this? How do I even go about >>>> figuring out what the problem is? This is run on a cluster of 15 instances >>>> on Amazon. >>>> >>>> Thanks, >>>> Ognen >>>> >>> >>> >>> >>> -- >>> Best Regards >>> ----------------------------------- >>> Xusen Yin 尹绪森 >>> Beijing Key Laboratory of Intelligent Telecommunications Software and >>> Multimedia >>> Beijing University of Posts & Telecommunications >>> Intel Labs China >>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* >>> >> >> >> >> -- >> "Le secret des grandes fortunes sans cause apparente est un crime >> oublié, parce qu'il a été proprement fait" - Honore de Balzac >> > > > > -- > Best Regards > ----------------------------------- > Xusen Yin 尹绪森 > Beijing Key Laboratory of Intelligent Telecommunications Software and > Multimedia > Beijing University of Posts & Telecommunications > Intel Labs China > Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>* > -- "Le secret des grandes fortunes sans cause apparente est un crime oublié, parce qu'il a été proprement fait" - Honore de Balzac