Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang
Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I rewrite code as list(set([i.items for i in a] + [i.items for i in b])). The program returns normal. By the way, I find that when the computation is running, UI will show scheduler delay. However,

Re: large scheduler delay in pyspark

2015-08-05 Thread ayan guha
It seems you want to dedupe your data after the merge so set(a+b) should also work..you may ditch the list comprehensiion operation. On 5 Aug 2015 23:55, gen tang gen.tan...@gmail.com wrote: Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu
On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote: Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])]

large scheduler delay in pyspark

2015-08-03 Thread gen tang
Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by adding two list if I do reduceByKey as