subject:"Replacing groupBykey\(\) with reduceByKey\(\)"

Re: Replacing groupBykey() with reduceByKey()

2018-08-08 Thread Biplob Biswas

Hi Santhosh, My name is not Bipin, its Biplob as is clear from my Signature. Regarding your question, I have no clue what your map operation is doing on the grouped data, so I can only suggest you to do : dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Bathi CCDB

Hey Bipin, Thanks for the reply, I am actually aggregating after the groupByKey() operation, I have posted the wrong code snippet in my first email. Here is what I am doing dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).map(build_edges) Can we replace

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Biplob Biswas

Hi Santhosh, If you are not performing any aggregation, then I don't think you can replace your groupbykey with a reducebykey, and as I see you are only grouping and taking 2 values of the result, thus I believe you can't just replace your groupbykey with that. Thanks & Regards Biplob Biswas

Replacing groupBykey() with reduceByKey()

2018-08-03 Thread Bathi CCDB

I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and python newbie and I am having a hard time figuring out the lambda function for the reduceByKey() operation. Here is the code dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2) Here