Replacing groupBykey() with reduceByKey()

Bathi CCDB Fri, 03 Aug 2018 15:05:57 -0700

I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and
python newbie and I am having a hard time figuring out the lambda function
for the reduceByKey() operation.


Here is the code

dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).take(2)

Here is the return value

>>> dd[(u'KEY_1', <pyspark.resultiterable.ResultIterable object at 
>>> 0x107be0c50>), (u'KEY_2', <pyspark.resultiterable.ResultIterable object at 
>>> 0x107be0c10>)]

and Here are the iterable contents dd[0][1]

Row(key=u'KEY_1', hash_fn=u'deec95d65ca6b3b4f2e1ef259040aa79',
value=u'e7dc1f2a')Row(key=u'KEY_1',
hash_fn=u'f8891048a9ef8331227b4af080ecd28a',
value=u'fb0bc953').......Row(key=u'KEY_1',
hash_fn=u'1b9d2bb2db28603ff21052efcd13f242',
value=u'd39714d3')Row(key=u'KEY_1',
hash_fn=u'c41b0269706ac423732a6bab24bf8a6a', value=u'ab58db92')

My question is how do replace with reduceByKey() and get the same output as
above?

Santhosh

Replacing groupBykey() with reduceByKey()

Reply via email to