If the data is already in RDD, the easiest way to calculate min/max for
each column would be an aggregate() function. It takes 2 functions as
arguments - first is used to aggregate RDD values to your "accumulator",
the second is used to merge two accumulators. This way both min and max for
all the columns in your RDD would be calculated in a single pass over it.
Here's an example in Python:

def agg1(x,y):
    if len(x) == 0: x = [y,y]
    return [map(min,zip(x[0],y)),map(max,zip(x[1],y))]

def agg2(x,y):
    if len(x) == 0: x = y
    return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))]

rdd  = sc.parallelize(xrange(100000), 5)
rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)]))
rdd2.aggregate([], agg1, agg2)

What personally I would do in your case depends on what else you want to do
with the data. If you plan to run some more business logic on top of it and
you're more comfortable with SQL, it might worth registering this DataFrame
as a table and generating SQL query to it (generate a string with a series
of min-max calls). But to solve your specific problem I'd load your file
with textFile(), use map() transformation to split the string by comma and
convert it to the array of doubles, and call aggregate() on top of it just
like I've shown in the example above

On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Or you can just call describe() on the dataframe? In addition to min-max,
> you'll also get the mean, and count of non-null and non-NA elements as well.
>
> Burak
>
> On Fri, Aug 28, 2015 at 10:09 AM, java8964 <java8...@hotmail.com> wrote:
>
>> Or RDD.max() and RDD.min() won't work for you?
>>
>> Yong
>>
>> ------------------------------
>> Subject: Re: Calculating Min and Max Values using Spark Transformations?
>> To: as...@wso2.com
>> CC: user@spark.apache.org
>> From: jfc...@us.ibm.com
>> Date: Fri, 28 Aug 2015 09:28:43 -0700
>>
>>
>> If you already loaded csv data into a dataframe, why not register it as a
>> table, and use Spark SQL
>> to find max/min or any other aggregates? SELECT MAX(column_name) FROM
>> dftable_name ... seems natural.
>>
>>
>>
>>
>>
>>    *JESSE CHEN*
>>    Big Data Performance | IBM Analytics
>>
>>    Office:  408 463 2296
>>    Mobile: 408 828 9068
>>    Email:   jfc...@us.ibm.com
>>
>>
>>
>> [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi
>> all, I have a dataset which consist of large number of feature]ashensw
>> ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large
>> number of features(columns). It is
>>
>> From: ashensw <as...@wso2.com>
>> To: user@spark.apache.org
>> Date: 08/28/2015 05:40 AM
>> Subject: Calculating Min and Max Values using Spark Transformations?
>>
>> ------------------------------
>>
>>
>>
>> Hi all,
>>
>> I have a dataset which consist of large number of features(columns). It is
>> in csv format. So I loaded it into a spark dataframe. Then I converted it
>> into a JavaRDD<Row> Then using a spark transformation I converted that
>> into
>> JavaRDD<String[]>. Then again converted it into a JavaRDD<double[]>. So
>> now
>> I have a JavaRDD<double[]>. So is there any method to calculate max and
>> min
>> values of each columns in this JavaRDD<double[]> ?
>>
>> Or Is there any way to access the array if I store max and min values to a
>> array inside the spark transformation class?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>


-- 
Best regards, Alexey Grishchenko

phone: +353 (87) 262-2154
email: programme...@gmail.com
web:   http://0x0fff.com

Reply via email to