Re: looking for an easy to to find the max value of a column in a data frame

Andy Davidson Tue, 29 Mar 2016 10:58:39 -0700

Nice

From:  Alexander Krasnukhin <the.malk...@gmail.com>
Date:  Tuesday, March 29, 2016 at 10:42 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: looking for an easy to to find the max value of a column in a
data frame


> You can even use the fact that pyspark has dynamic properties
> 
> rows = idDF2.select(max("col[id]").alias("max")).collect()
> firstRow = rows[0]
> max = firstRow.max
> 
> On Tue, Mar 29, 2016 at 7:14 PM, Alexander Krasnukhin <the.malk...@gmail.com>
> wrote:
>> You should be able to index columns directly either by index or column name
>> i.e.
>> 
>> from pyspark.sql.functions import max
>> 
>> rows = idDF2.select(max("col[id]")).collect()
>> firstRow = rows[0]
>> 
>> # by index
>> max = firstRow[0]
>> 
>> # by column name
>> max = firstRow["max(col[id])"]
>> 
>> On Tue, Mar 29, 2016 at 6:58 PM, Andy Davidson
>> <a...@santacruzintegration.com> wrote:
>>> Hi Alexander
>>> 
>>> Many thanks. I think the key was I needed to import that max function. Turns
>>> out you do not need to use col
>>> Df.select(max(³foo²)).show()
>>> 
>>> To get the actual value of max you still need to write more code than I
>>> would expect. I wonder if there is a easier way to work with Rows?
>>> 
>>> In [19]:
>>> from pyspark.sql.functions import max
>>> maxRow = idDF2.select(max("col[id]")).collect()
>>> max = maxRow[0].asDict()['max(col[id])']
>>> max
>>> Out[19]:
>>> 713912692155621376
>>> 
>>> From:  Alexander Krasnukhin <the.malk...@gmail.com>
>>> Date:  Monday, March 28, 2016 at 5:55 PM
>>> To:  Andrew Davidson <a...@santacruzintegration.com>
>>> Cc:  "user @spark" <user@spark.apache.org>
>>> Subject:  Re: looking for an easy to to find the max value of a column in a
>>> data frame
>>> 
>>>> e.g. select max value for column "foo":
>>>> 
>>>> from pyspark.sql.functions import max, col
>>>> df.select(max(col("foo"))).show()
>>>> 
>>>> On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson
>>>> <a...@santacruzintegration.com> wrote:
>>>>> I am using pyspark 1.6.1 and python3.
>>>>> 
>>>>> 
>>>>> Given:
>>>>> 
>>>>> idDF2 = idDF.select(idDF.id, idDF.col.id <http://idDF.col.id>  )
>>>>> idDF2.printSchema()
>>>>> idDF2.show()
>>>>> root
>>>>>  |-- id: string (nullable = true)
>>>>>  |-- col[id]: long (nullable = true)
>>>>> 
>>>>> +----------+----------+
>>>>> |        id|   col[id]|
>>>>> +----------+----------+
>>>>> |1008930924| 534494917|
>>>>> |1008930924| 442237496|
>>>>> |1008930924|  98069752|
>>>>> |1008930924|2790311425|
>>>>> |1008930924|3300869821|
>>>>> 
>>>>> 
>>>>> I have to do a lot of work to get the max value
>>>>> 
>>>>> rows = idDF2.select("col[id]").describe().collect()
>>>>> hack = [s for s in rows if s.summary == 'max']
>>>>> print(hack)
>>>>> print(hack[0].summary)
>>>>> print(type(hack[0]))
>>>>> print(hack[0].asDict()['col[id]'])
>>>>> maxStr = hack[0].asDict()['col[id]']
>>>>> ttt = int(maxStr)
>>>>> numDimensions = 1 + ttt
>>>>> print(numDimensions)
>>>>> 
>>>>> Is there an easier way?
>>>>> 
>>>>> Kind regards
>>>>> 
>>>>> Andy
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Regards,
>>>> Alexander
>> 
>> 
>> 
>> -- 
>> Regards,
>> Alexander
> 
> 
> 
> -- 
> Regards,
> Alexander

Re: looking for an easy to to find the max value of a column in a data frame

Reply via email to