Anyone any thought on this?
On 22 April 2015 at 22:49, Jeetendra Gangele gangele...@gmail.com wrote:
I made 7000 tasks in mapTopair and in distinct also I made same number of
tasks.
Still lots of shuffle read and write is happening due to application
running for much longer time.
Any idea?
How many tasks are you seeing in your mapToPair stage? Is it 7000? then i
suggest you giving a number similar/close to 7000 in your .distinct call,
what is happening in your case is that, you are repartitioning your data to
a smaller number (32) which would put a lot of load on processing i
I am saying to partition something like partitionBy(new HashPartitioner(16)
will this not work?
On 17 April 2015 at 21:28, Jeetendra Gangele gangele...@gmail.com wrote:
I have given 3000 task to mapToPair now its taking so much memory and
shuffling and wasting time there. Here is the stats
Can you paste your complete code? Did you try repartioning/increasing level
of parallelism to speed up the processing. Since you have 16 cores, and I'm
assuming your 400k records isn't bigger than a 10G dataset.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele
Hi All I have below code whether distinct is running for more time.
blockingRdd is the combination of Long,String and it will have 400K
records
JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues(
new FunctionString, IterableInteger(){
@Override
public IterableInteger
Open the driver ui and see which stage is taking time, you can look whether
its adding any GC time etc.
Thanks
Best Regards
On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com
wrote:
Hi All I have below code whether distinct is running for more time.
blockingRdd is the
I already checked and G is taking 1 secs for each task. is this too much?
if yes how to avoid this?
On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote:
Open the driver ui and see which stage is taking time, you can look
whether its adding any GC time etc.
Thanks
Best
No I did not tried the partitioning below is the full code
public static void matchAndMerge(JavaRDDVendorRecord
matchRdd,JavaSparkContext jsc) throws IOException{
long start = System.currentTimeMillis();
JavaPairRDDLong, MatcherReleventData RddForMarch
=matchRdd.zipWithIndex().mapToPair(new
Akhil, any thought on this?
On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote:
No I did not tried the partitioning below is the full code
public static void matchAndMerge(JavaRDDVendorRecord
matchRdd,JavaSparkContext jsc) throws IOException{
long start =
at distinct level I will have 7000 times more elements in my RDD.So should
I re partition? because its parent will definitely have less partition how
to see through java code number of partition?
On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote:
No I did not tried the
10 matches
Mail list logo