Nice From: Alexander Krasnukhin <the.malk...@gmail.com> Date: Tuesday, March 29, 2016 at 10:42 AM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: looking for an easy to to find the max value of a column in a data frame
> You can even use the fact that pyspark has dynamic properties > > rows = idDF2.select(max("col[id]").alias("max")).collect() > firstRow = rows[0] > max = firstRow.max > > On Tue, Mar 29, 2016 at 7:14 PM, Alexander Krasnukhin <the.malk...@gmail.com> > wrote: >> You should be able to index columns directly either by index or column name >> i.e. >> >> from pyspark.sql.functions import max >> >> rows = idDF2.select(max("col[id]")).collect() >> firstRow = rows[0] >> >> # by index >> max = firstRow[0] >> >> # by column name >> max = firstRow["max(col[id])"] >> >> On Tue, Mar 29, 2016 at 6:58 PM, Andy Davidson >> <a...@santacruzintegration.com> wrote: >>> Hi Alexander >>> >>> Many thanks. I think the key was I needed to import that max function. Turns >>> out you do not need to use col >>> Df.select(max(³foo²)).show() >>> >>> To get the actual value of max you still need to write more code than I >>> would expect. I wonder if there is a easier way to work with Rows? >>> >>> In [19]: >>> from pyspark.sql.functions import max >>> maxRow = idDF2.select(max("col[id]")).collect() >>> max = maxRow[0].asDict()['max(col[id])'] >>> max >>> Out[19]: >>> 713912692155621376 >>> >>> From: Alexander Krasnukhin <the.malk...@gmail.com> >>> Date: Monday, March 28, 2016 at 5:55 PM >>> To: Andrew Davidson <a...@santacruzintegration.com> >>> Cc: "user @spark" <user@spark.apache.org> >>> Subject: Re: looking for an easy to to find the max value of a column in a >>> data frame >>> >>>> e.g. select max value for column "foo": >>>> >>>> from pyspark.sql.functions import max, col >>>> df.select(max(col("foo"))).show() >>>> >>>> On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson >>>> <a...@santacruzintegration.com> wrote: >>>>> I am using pyspark 1.6.1 and python3. >>>>> >>>>> >>>>> Given: >>>>> >>>>> idDF2 = idDF.select(idDF.id, idDF.col.id <http://idDF.col.id> ) >>>>> idDF2.printSchema() >>>>> idDF2.show() >>>>> root >>>>> |-- id: string (nullable = true) >>>>> |-- col[id]: long (nullable = true) >>>>> >>>>> +----------+----------+ >>>>> | id| col[id]| >>>>> +----------+----------+ >>>>> |1008930924| 534494917| >>>>> |1008930924| 442237496| >>>>> |1008930924| 98069752| >>>>> |1008930924|2790311425| >>>>> |1008930924|3300869821| >>>>> >>>>> >>>>> I have to do a lot of work to get the max value >>>>> >>>>> rows = idDF2.select("col[id]").describe().collect() >>>>> hack = [s for s in rows if s.summary == 'max'] >>>>> print(hack) >>>>> print(hack[0].summary) >>>>> print(type(hack[0])) >>>>> print(hack[0].asDict()['col[id]']) >>>>> maxStr = hack[0].asDict()['col[id]'] >>>>> ttt = int(maxStr) >>>>> numDimensions = 1 + ttt >>>>> print(numDimensions) >>>>> >>>>> Is there an easier way? >>>>> >>>>> Kind regards >>>>> >>>>> Andy >>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Alexander >> >> >> >> -- >> Regards, >> Alexander > > > > -- > Regards, > Alexander