ok . i see there is a describe() function which does the stat calculation on dataset similar to StatCounter but however i dont want to restrict my aggregations to standard mean, stddev etc and generate some custom stats , or also may not run all the predefined stats but only subset of them on the particular column. I was thinking if we need to write some custom code which does this in one action(job) that would work for me
On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler <georg.kf.hei...@gmail.com> wrote: > Rdd only > Patrick <titlibat...@gmail.com> schrieb am Mo. 28. Aug. 2017 um 20:13: > >> Ah, does it work with Dataset API or i need to convert it to RDD first ? >> >> On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler <georg.kf.hei...@gmail.com >> > wrote: >> >>> What about the rdd stat counter? https://spark.apache.org/docs/ >>> 0.6.2/api/core/spark/util/StatCounter.html >>> >>> Patrick <titlibat...@gmail.com> schrieb am Mo. 28. Aug. 2017 um 16:47: >>> >>>> Hi >>>> >>>> I have two lists: >>>> >>>> >>>> - List one: contains names of columns on which I want to do >>>> aggregate operations. >>>> - List two: contains the aggregate operations on which I want to >>>> perform on each column eg ( min, max, mean) >>>> >>>> I am trying to use spark 2.0 dataset to achieve this. Spark provides an >>>> agg() where you can pass a Map <String,String> (of column name and >>>> respective aggregate operation ) as input, however I want to perform >>>> different aggregation operations on the same column of the data and want to >>>> collect the result in a Map<String,String> where key is the aggregate >>>> operation and Value is the result on the particular column. If i add >>>> different agg() to same column, the key gets updated with latest value. >>>> >>>> Also I dont find any collectAsMap() operation that returns map of >>>> aggregated column name as key and result as value. I get collectAsList() >>>> but i dont know the order in which those agg() operations are run so how do >>>> i match which list values corresponds to which agg operation. I am able to >>>> see the result using .show() but How can i collect the result in this case >>>> ? >>>> >>>> Is it possible to do different aggregation on the same column in one >>>> Job(i.e only one collect operation) using agg() operation? >>>> >>>> >>>> Thanks in advance. >>>> >>>> >>