Hi,
Thanks a lot for your reply.
It seems that it is because of the slowness of the second code.
I rewrite code as list(set([i.items for i in a] + [i.items for i in b])).
The program returns normal.
By the way, I find that when the computation is running, UI will show
scheduler delay. However,
It seems you want to dedupe your data after the merge so set(a+b) should
also work..you may ditch the list comprehensiion operation.
On 5 Aug 2015 23:55, gen tang gen.tan...@gmail.com wrote:
Hi,
Thanks a lot for your reply.
It seems that it is because of the slowness of the second code.
I
On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])]
Hi,
Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to
merge value by adding two list
if I do reduceByKey as