Re: Non-deterministic behavior in spark

Ognen Duzlevski Fri, 24 Jan 2014 05:40:32 -0800

Thanks.

This is a VERY simple example.


I have two 20 GB json files. Each line in the files has the same format.
I run: val events = filter(_split(something)(get the field)).map(field =>
(field, 0)) on the first file
I then run val events1 = the same filter on the second file and do
map(field => (field, 1))

This ensures that events has form of (field, 0) and events1 has form of
(field, 1)

I then to val ret=events.union(events1) - this will put all the fields in
the same RDD

Then I do val r = ret.groupByKey().filter(e => e._2.length > 1 &&
e._2(0)==0) to make sure all groups with key field have at least two
elements and the first one is a zero (so, for example, an entry in this
structure will have form (field, (0, 1. 1, 1....))

I then just do a simple r.count

Ognen



On Fri, Jan 24, 2014 at 1:29 PM, 尹绪森 <yinxu...@gmail.com> wrote:

> 1. Does there any in-place operation in you code? Such as addi() for
> DoubleMatrix. This kind of operation will affect the original data.
>
> 2. You could try to use Spark replay debugger, there is a assert function.
> Hope that helpful.
> http://spark-replay-debugger-overview.readthedocs.org/en/latest/
>
>
> 2014/1/24 Ognen Duzlevski <og...@plainvanillagames.com>
>
>> No. It is a filter that splits a line in a json file and extracts a
>> position for it - every run is the same.
>>
>> That's what bothers me about this.
>>
>> Ognen
>>
>>
>> On Fri, Jan 24, 2014 at 12:40 PM, 尹绪森 <yinxu...@gmail.com> wrote:
>>
>>>  Does there are some non-deterministic codes in filter ? Such as
>>> Random.nextInt(). If so, the program lost the idempotent feature. You
>>> should specify a seed to it.
>>>
>>>
>>> 2014/1/24 Ognen Duzlevski <og...@nengoiksvelzud.com>
>>>
>>>> Hello,
>>>>
>>>> (Sorry for the sensationalist title) :)
>>>>
>>>> If I run Spark on files from S3 and do basic transformation like:
>>>>
>>>> textfile()
>>>> filter
>>>> groupByKey
>>>> count
>>>>
>>>> I get one number (e.g. 40,000).
>>>>
>>>> If I do the same on the same files from HDFS, the number spat out is
>>>> completely different (VERY different - something like 13,000).
>>>>
>>>> What would one do in a situation like this? How do I even go about
>>>> figuring out what the problem is? This is run on a cluster of 15 instances
>>>> on Amazon.
>>>>
>>>> Thanks,
>>>> Ognen
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>> -----------------------------------
>>> Xusen Yin    尹绪森
>>> Beijing Key Laboratory of Intelligent Telecommunications Software and
>>> Multimedia
>>> Beijing University of Posts & Telecommunications
>>> Intel Labs China
>>> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>>>
>>
>>
>>
>> --
>> "Le secret des grandes fortunes sans cause apparente est un crime
>> oublié, parce qu'il a été proprement fait" - Honore de Balzac
>>
>
>
>
> --
> Best Regards
> -----------------------------------
> Xusen Yin    尹绪森
> Beijing Key Laboratory of Intelligent Telecommunications Software and
> Multimedia
> Beijing University of Posts & Telecommunications
> Intel Labs China
> Homepage: *http://yinxusen.github.io/ <http://yinxusen.github.io/>*
>



-- 
"Le secret des grandes fortunes sans cause apparente est un crime oublié,
parce qu'il a été proprement fait" - Honore de Balzac

Re: Non-deterministic behavior in spark

Reply via email to