Re: reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan


> On Jun 26, 2015, at 12:46 AM, Sven Krasser  wrote:
> 
> In that case the reduceByKey operation will likely not give you any benefit 
> (since you are not aggregating data into smaller values but instead building 
> the same large list you'd build with groupByKey). 

great. thanks!. i overlooked that. I guess it might even be better to use 
groupByKey if the aggregated list is very huge for some keys?.


> If you look at rdd.py, you can see that both operations eventually use a 
> similar operation to do the actual work:
> 
> agg = Aggregator(createCombiner, mergeValue, mergeCombiners)
> 
> Best,
> -Sven
> 
> On Thu, Jun 25, 2015 at 4:34 PM, Kannappan Sirchabesan  <mailto:buildka...@gmail.com>> wrote:
> Thanks. This should work fine. 
> 
> I am trying to avoid groupByKey for performance reasons as the input is a 
> giant RDD. and the operation is a associative operation, so minimal shuffle 
> if done via reduceByKey.
> 
>> On Jun 26, 2015, at 12:25 AM, Sven Krasser > <mailto:kras...@gmail.com>> wrote:
>> 
>> Hey Kannappan,
>> 
>> First of all, what is the reason for avoiding groupByKey since this is 
>> exactly what it is for? If you must use reduceByKey with a one-liner, then 
>> take a look at this:
>> 
>> lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else 
>> [b])
>> 
>> In contrast to groupByKey, this won't return 'Yorkshire' as a one element 
>> list but as a plain string (i.e. in the same way as in your output example).
>> 
>> Hope this helps!
>> -Sven
>> 
>> On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan > <mailto:buildka...@gmail.com>> wrote:
>> Hi,
>>   I am trying to see what is the best way to reduce the values of a RDD of 
>> (key,value) pairs into (key,ListOfValues) pair. I know various ways of 
>> achieving this, but I am looking for a efficient, elegant one-liner if there 
>> is one.
>> 
>> Example:
>> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado)
>> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)
>> 
>> Is it possible to use reduceByKey or foldByKey to achieve this, instead of 
>> groupBykey.
>> 
>> Something equivalent to a cons operator from LISP?, so that I could just say 
>> reduceBykey(lambda x,y:  (cons x y) ). May be it is more a python question 
>> than a spark question of how to create a list from 2 elements without a 
>> starting empty list?
>> 
>> Thanks,
>> Kannappan
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
>> 
>> 
>> -- 
>> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>
> 
> 
> 
> 
> -- 
> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>



Re: reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Thanks. This should work fine. 

I am trying to avoid groupByKey for performance reasons as the input is a giant 
RDD. and the operation is a associative operation, so minimal shuffle if done 
via reduceByKey.

> On Jun 26, 2015, at 12:25 AM, Sven Krasser  wrote:
> 
> Hey Kannappan,
> 
> First of all, what is the reason for avoiding groupByKey since this is 
> exactly what it is for? If you must use reduceByKey with a one-liner, then 
> take a look at this:
> 
> lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else 
> [b])
> 
> In contrast to groupByKey, this won't return 'Yorkshire' as a one element 
> list but as a plain string (i.e. in the same way as in your output example).
> 
> Hope this helps!
> -Sven
> 
> On Thu, Jun 25, 2015 at 3:37 PM, Kannappan Sirchabesan  <mailto:buildka...@gmail.com>> wrote:
> Hi,
>   I am trying to see what is the best way to reduce the values of a RDD of 
> (key,value) pairs into (key,ListOfValues) pair. I know various ways of 
> achieving this, but I am looking for a efficient, elegant one-liner if there 
> is one.
> 
> Example:
> Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado)
> Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)
> 
> Is it possible to use reduceByKey or foldByKey to achieve this, instead of 
> groupBykey.
> 
> Something equivalent to a cons operator from LISP?, so that I could just say 
> reduceBykey(lambda x,y:  (cons x y) ). May be it is more a python question 
> than a spark question of how to create a list from 2 elements without a 
> starting empty list?
> 
> Thanks,
> Kannappan
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> www.skrasser.com <http://www.skrasser.com/?utm_source=sig>



Re: Scala/Python or Java

2015-06-25 Thread Kannappan Sirchabesan
Hi,
  If you are new to all three languages, go with Scala or Python. Python is 
easier but check out Scala and see if it is easy enough for you. With the 
launch of data frames, it might not even matter which language you choose 
performance-wise.

Thanks,
Kannappan

> On Jun 25, 2015, at 10:02 PM, spark user  wrote:
> 
> Spark is based on Scala and it written in Scala .To debug and fix issue i 
> guess learning Scala is good  for long term ? any advise ?
> 
> 
> 
> On Thursday, June 25, 2015 1:26 PM, ayan guha  wrote:
> 
> 
> I am a python fan so I use python. But what I noticed some features are 
> typically 1-2 release behind for python. So I strongly agree with Ted that 
> start with language you are most familiar with and plan to move to scala 
> eventually
> On 26 Jun 2015 06:07, "Ted Yu"  > wrote:
> The answer depends on the user's experience with these languages as well as 
> the most commonly used language in the production environment.
> 
> Learning Scala requires some time. If you're very comfortable with Java / 
> Python, you can go with that while at the same time familiarizing yourself 
> with Scala.
> 
> Cheers
> 
> On Thu, Jun 25, 2015 at 12:04 PM, spark user  > wrote:
> Hi All ,
> 
> I am new for spark , i just want to know which technology is good/best for 
> spark learning ?
> 
> 1) Scala 
> 2) Java 
> 3) Python 
> 
> I know spark support all 3 languages , but which one is best ?
> 
> Thanks 
> su  
> 
> 
> 
> 
> 



reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Hi,
  I am trying to see what is the best way to reduce the values of a RDD of 
(key,value) pairs into (key,ListOfValues) pair. I know various ways of 
achieving this, but I am looking for a efficient, elegant one-liner if there is 
one. 

Example:
Input RDD: (USA, California), (UK, Yorkshire), (USA, Colorado)
Output RDD: (USA, [California, Colorado]), (UK, Yorkshire)

Is it possible to use reduceByKey or foldByKey to achieve this, instead of 
groupBykey. 

Something equivalent to a cons operator from LISP?, so that I could just say 
reduceBykey(lambda x,y:  (cons x y) ). May be it is more a python question than 
a spark question of how to create a list from 2 elements without a starting 
empty list?

Thanks,
Kannappan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org