Hi Santhosh,
My name is not Bipin, its Biplob as is clear from my Signature.
Regarding your question, I have no clue what your map operation is doing on
the grouped data, so I can only suggest you to do :
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
Hey Bipin,
Thanks for the reply, I am actually aggregating after the groupByKey()
operation,
I have posted the wrong code snippet in my first email. Here is what I am
doing
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).map(build_edges)
Can we replace
Hi Santhosh,
If you are not performing any aggregation, then I don't think you can
replace your groupbykey with a reducebykey, and as I see you are only
grouping and taking 2 values of the result, thus I believe you can't just
replace your groupbykey with that.
Thanks & Regards
Biplob Biswas
I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and
python newbie and I am having a hard time figuring out the lambda function
for the reduceByKey() operation.
Here is the code
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).take(2)
Here