Are the elements count big per group? If not, you can group them and use the 
code to calculate the median and diff.


Yong

________________________________
From: Craig Ching <craigch...@gmail.com>
Sent: Wednesday, March 22, 2017 3:17 PM
To: user@spark.apache.org
Subject: calculate diff of value and median in a group

Hi,

When using pyspark, I'd like to be able to calculate the difference between 
grouped values and their median for the group.  Is this possible?  Here is some 
code I hacked up that does what I want except that it calculates the grouped 
diff from mean.  Also, please feel free to comment on how I could make this 
better if you feel like being helpful :)

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import (
    StringType,
    LongType,
    DoubleType,
    StructField,
    StructType
)
from pyspark.sql import functions as F


sc = SparkContext(appName='myapp')
spark = SparkSession(sc)

file_name = 'data.csv'

fields = [
    StructField(
        'group2',
        LongType(),
        True),
    StructField(
        'name',
        StringType(),
        True),
    StructField(
        'value',
        DoubleType(),
        True),
    StructField(
        'group1',
        LongType(),
        True)
]
schema = StructType(fields)

df = spark.read.csv(
    file_name, header=False, mode="DROPMALFORMED", schema=schema
)
df.show()
means = df.select([
    'group1',
    'group2',
    'name',
    'value']).groupBy([
        'group1',
        'group2'
    ]).agg(
        F.mean('value').alias('mean_value')
    ).orderBy('group1', 'group2')

cond = [df.group1 == means.group1, df.group2 == means.group2]

means.show()
df = df.select([
    'group1',
    'group2',
    'name',
    'value']).join(
        means,
        cond
    ).drop(
        df.group1
    ).drop(
        df.group2
    ).select('group1',
             'group2',
             'name',
             'value',
             'mean_value')

final = df.withColumn(
    'diff',
    F.abs(df.value - df.mean_value))
final.show()

sc.stop()

And here is an example dataset I'm playing with:

100,name1,0.43,0
100,name2,0.33,0
100,name3,0.73,0
101,name1,0.29,0
101,name2,0.96,0
101,name3,0.42,0
102,name1,0.01,0
102,name2,0.42,0
102,name3,0.51,0
103,name1,0.55,0
103,name2,0.45,0
103,name3,0.02,0
104,name1,0.93,0
104,name2,0.16,0
104,name3,0.74,0
105,name1,0.41,0
105,name2,0.65,0
105,name3,0.29,0
100,name1,0.51,1
100,name2,0.51,1
100,name3,0.43,1
101,name1,0.59,1
101,name2,0.55,1
101,name3,0.84,1
102,name1,0.01,1
102,name2,0.98,1
102,name3,0.44,1
103,name1,0.47,1
103,name2,0.16,1
103,name3,0.02,1
104,name1,0.83,1
104,name2,0.89,1
104,name3,0.31,1
105,name1,0.59,1
105,name2,0.77,1
105,name3,0.45,1

and here is what I'm trying to produce:

group1,group2,name,value,median,diff
0,100,name1,0.43,0.43,0.0
0,100,name2,0.33,0.43,0.10
0,100,name3,0.73,0.43,0.30
0,101,name1,0.29,0.42,0.13
0,101,name2,0.96,0.42,0.54
0,101,name3,0.42,0.42,0.0
0,102,name1,0.01,0.42,0.41
0,102,name2,0.42,0.42,0.0
0,102,name3,0.51,0.42,0.09
0,103,name1,0.55,0.45,0.10
0,103,name2,0.45,0.45,0.0
0,103,name3,0.02,0.45,0.43
0,104,name1,0.93,0.74,0.19
0,104,name2,0.16,0.74,0.58
0,104,name3,0.74,0.74,0.0
0,105,name1,0.41,0.41,0.0
0,105,name2,0.65,0.41,0.24
0,105,name3,0.29,0.41,0.24
1,100,name1,0.51,0.51,0.0
1,100,name2,0.51,0.51,0.0
1,100,name3,0.43,0.51,0.08
1,101,name1,0.59,0.59,0.0
1,101,name2,0.55,0.59,0.04
1,101,name3,0.84,0.59,0.25
1,102,name1,0.01,0.44,0.43
1,102,name2,0.98,0.44,0.54
1,102,name3,0.44,0.44,0.0
1,103,name1,0.47,0.16,0.31
1,103,name2,0.16,0.16,0.0
1,103,name3,0.02,0.16,0.14
1,104,name1,0.83,0.83,0.0
1,104,name2,0.89,0.83,0.06
1,104,name3,0.31,0.83,0.52
1,105,name1,0.59,0.59,0.0
1,105,name2,0.77,0.59,0.18
1,105,name3,0.45,0.59,0.14

Thanks for any help!

Cheers,
Craig

Reply via email to