Re: Distinct is very slow

2015-04-23 Thread Jeetendra Gangele
Anyone any thought on this? On 22 April 2015 at 22:49, Jeetendra Gangele gangele...@gmail.com wrote: I made 7000 tasks in mapTopair and in distinct also I made same number of tasks. Still lots of shuffle read and write is happening due to application running for much longer time. Any idea?

Re: Distinct is very slow

2015-04-17 Thread Akhil Das
How many tasks are you seeing in your mapToPair stage? Is it 7000? then i suggest you giving a number similar/close to 7000 in your .distinct call, what is happening in your case is that, you are repartitioning your data to a smaller number (32) which would put a lot of load on processing i

Re: Distinct is very slow

2015-04-17 Thread Jeetendra Gangele
I am saying to partition something like partitionBy(new HashPartitioner(16) will this not work? On 17 April 2015 at 21:28, Jeetendra Gangele gangele...@gmail.com wrote: I have given 3000 task to mapToPair now its taking so much memory and shuffling and wasting time there. Here is the stats

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Can you paste your complete code? Did you try repartioning/increasing level of parallelism to speed up the processing. Since you have 16 cores, and I'm assuming your 400k records isn't bigger than a 10G dataset. Thanks Best Regards On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele

Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Hi All I have below code whether distinct is running for more time. blockingRdd is the combination of Long,String and it will have 400K records JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues( new FunctionString, IterableInteger(){ @Override public IterableInteger

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best Regards On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have below code whether distinct is running for more time. blockingRdd is the

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
I already checked and G is taking 1 secs for each task. is this too much? if yes how to avoid this? On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote: Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDDVendorRecord matchRdd,JavaSparkContext jsc) throws IOException{ long start = System.currentTimeMillis(); JavaPairRDDLong, MatcherReleventData RddForMarch =matchRdd.zipWithIndex().mapToPair(new

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Akhil, any thought on this? On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote: No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDDVendorRecord matchRdd,JavaSparkContext jsc) throws IOException{ long start =

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
at distinct level I will have 7000 times more elements in my RDD.So should I re partition? because its parent will definitely have less partition how to see through java code number of partition? On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote: No I did not tried the