subject:"large scheduler delay in pyspark"

Re: large scheduler delay in pyspark

2015-08-05 Thread ayan guha

It seems you want to dedupe your data after the merge so set(a+b) should also work..you may ditch the list comprehensiion operation. On 5 Aug 2015 23:55, "gen tang" wrote: > Hi, > Thanks a lot for your reply. > > > It seems that it is because of the slowness of the second code. > I rewrite code a

Re: large scheduler delay in pyspark

2015-08-05 Thread gen tang

Hi, Thanks a lot for your reply. It seems that it is because of the slowness of the second code. I rewrite code as list(set([i.items for i in a] + [i.items for i in b])). The program returns normal. By the way, I find that when the computation is running, UI will show scheduler delay. However, i

Re: large scheduler delay in pyspark

2015-08-04 Thread Davies Liu

On Mon, Aug 3, 2015 at 9:00 AM, gen tang wrote: > Hi, > > Recently, I met some problems about scheduler delay in pyspark. I worked > several days on this problem, but not success. Therefore, I come to here to > ask for help. > > I have a key_value pair rdd like rdd[(key, list[dict])] and I tried t

large scheduler delay in pyspark

2015-08-03 Thread gen tang

Hi, Recently, I met some problems about scheduler delay in pyspark. I worked several days on this problem, but not success. Therefore, I come to here to ask for help. I have a key_value pair rdd like rdd[(key, list[dict])] and I tried to merge value by "adding" two list if I do reduceByKey as fo