Hey Kannappan,
First of all, what is the reason for avoiding groupByKey since this is
exactly what it is for? If you must use reduceByKey with a one-liner, then
take a look at this:
lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else
[b])
In contrast to groupByKey
Thanks. This should work fine.
I am trying to avoid groupByKey for performance reasons as the input is a giant
RDD. and the operation is a associative operation, so minimal shuffle if done
via reduceByKey.
On Jun 26, 2015, at 12:25 AM, Sven Krasser kras...@gmail.com wrote:
Hey Kannappan
In that case the reduceByKey operation will likely not give you any benefit
(since you are not aggregating data into smaller values but instead
building the same large list you'd build with groupByKey). If you look at
rdd.py, you can see that both operations eventually use a similar operation
On Thu, Jun 25, 2015 at 5:01 PM, Kannappan Sirchabesan buildka...@gmail.com
wrote:
On Jun 26, 2015, at 12:46 AM, Sven Krasser kras...@gmail.com wrote:
In that case the reduceByKey operation will likely not give you any
benefit (since you are not aggregating data into smaller values
On Jun 26, 2015, at 12:46 AM, Sven Krasser kras...@gmail.com wrote:
In that case the reduceByKey operation will likely not give you any benefit
(since you are not aggregating data into smaller values but instead building
the same large list you'd build with groupByKey).
great. thanks
Hi Marchelo,
The collected data are collected in say class C. c.id is the id of each
of those data. But that id might appear more than once in those 1mil xml
files, so I am doing a reduceByKey(). Even if I had multiple binaryFile
RDD's, wouldn't I have to ++ those in order to correctly
Now I am profiling the executor.
There seems to be a memory leak.
20 mins after the run there were:
157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for 17MB
487k java.util.zip.ZStreamRef for 11MB
An hour after the run I got :
186k byte[]
..and keeps on increasing.
maybe there is a bug in some code that zips/unzips data.
109k instances of byte[] followed by 1 mil instances of Finalizer, with
~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef
I assume that's due to either binaryFiles or saveAsObjectFile
On 11/06/15
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[]
So probably information about the files again.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with
the same contents are considered as different values because their reference
values are different.
I didn't see any to pass in a custom comparer. I could convert the byte[] into
a String with an explicit charset, but
I think if you wrap the byte[] into an object and implement equals and
hashcode methods, you may be able to do this. There will be the overhead of
extra object, but conceptually it should work unless I am missing
something.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Subject: Re: ReduceByKey with a byte array as the key
I think if you wrap the byte[] into an object and implement equals and hashcode
methods, you may be able to do this. There will be the overhead of extra
object, but conceptually it should work unless I am missing something.
Best Regards,
Sonal
between this and using `String` would be
similar enough to warrant just using `String`.
Mark
*From:* Sonal Goyal [mailto:sonalgoy...@gmail.com]
*Sent:* June-11-15 12:58 PM
*To:* Mark Tse
*Cc:* user@spark.apache.org
*Subject:* Re: ReduceByKey with a byte array as the key
I think if you
).reduceByKey(
reducer ).saveAsObjectFile()
Initially I had it with groupBy but that method uses a lot of resources
(according to the javadocs). Switching to reduceByKey didn't have any
effect.
Seems like spark goes into 2 cycles of calculations of ~270k of items. In
the 1st round, around 15GB
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it
is still running. What are those long[] used for?
--
View this message in context:
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
kostas.koug...@googlemail.com wrote:
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so
After some time the driver accumulated 6.67GB of long[] . The executor mem
usage so far is low.
--
View this message in context:
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
and
running reduce on these case class objects solves the problem)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com
+ chill-avro(mapping avro object to simple scala case class and
running reduce on these case class objects solves the problem)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
Sent from the Apache Spark
class
and
running reduce on these case class objects solves the problem)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
on these case class objects solves the problem)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Functions.ListStringToInt())
.reduceByKey(new SumList());
Functions implementation using Java 7. Java 8 should be more simple.
public static final class SumList implements Function2ListInteger,
ListInteger, ListInteger {
@Override public ListInteger call(ListInteger l1,
ListInteger l2) throws
Hi,
I have JavaPairRDDString, String and I want to implement reduceByKey
method.
My pairRDD :
*2553: 0,0,0,1,0,0,0,0*
46551: 0,1,0,0,0,0,0,0
266: 0,1,0,0,0,0,0,0
*2553: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
I want to get :
*2553: 0,0,0,1,0,1,0,0*
46551
Kaya godo...@gmail.com wrote:
Hi,
I have JavaPairRDDString, String and I want to implement reduceByKey
method.
My pairRDD :
*2553: 0,0,0,1,0,0,0,0*
46551: 0,1,0,0,0,0,0,0
266: 0,1,0,0,0,0,0,0
*2553: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
I want to get
shoot me an email if you need any help with spark-sorted. it does not
(yet?) have a java api, so you will have to work in scala
On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote:
I think this Spark Package may be what you're looking for!
oh wow, that is a really interesting observation, Marco Jerry.
I wonder if this is worth exposing in combineByKey()? I think Jerry's
proposed workaround is all you can do for now -- use reflection to
side-step the fact that the methods you need are private.
On Mon, Apr 27, 2015 at 8:07 AM,
I think this Spark Package may be what you're looking for!
http://spark-packages.org/package/tresata/spark-sorted
Best,
Burak
On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote:
oh wow, that is a really interesting observation, Marco Jerry.
I wonder if this is worth
of this aggregation is sorted. A
way to do that can be something like
flatpMapToPair(myMapFunc).reduceByKey(RangePartitioner,myReduceFunc).mapPartition(i
- sort(i)).
But I was thinking that the sorting phase can be pushed down to the
shuffle phase, as the same thing is done in sortByKey
Hi,
I'm trying, after reducing by key, to get data ordered among partitions
(like RangePartitioner) and within partitions (like sortByKey or
repartitionAndSortWithinPartition) pushing the sorting down to the
shuffles machinery of the reducing phase.
I think, but maybe I'm wrong, that the correct
Hi Marco,
As I know, current combineByKey() does not expose the related argument
where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is
package private, if you can get the ShuffledRDD through reflection or other
way, the keyOrdering you set will be pushed down to shuffle. If you
Standard Time
To: user@spark.apache.org
Subject: ReduceByKey and sorting within partitions
Hi,
I'm trying, after reducing by key, to get data ordered among partitions
(like RangePartitioner) and within partitions (like sortByKey or
repartitionAndSortWithinPartition) pushing the sorting down
Can you show us the function you passed to reduceByKey() ?
What release of Spark are you using ?
Cheers
On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke lovejay-lovemu...@outlook.com
wrote:
I'm trying to solve a Word-Count like problem, the difference lies in
that, I need the count of a specific
is represented with Python
datetime.datetime class. And I continued to transform it to ((time, word_id),
1) then use reduceByKey for result.
But the dataset returned is a little weird, just like the following:
format:((timespan with datetime.datetime, wordid), freq)
((datetime.datetime(2009, 10, 6, 2
datetime.datetime class. And I continued to transform it to ((time,
word_id), 1) then use reduceByKey for result.
But the dataset returned is a little weird, just like the following:
format:
((timespan with datetime.datetime, wordid), freq)
((datetime.datetime(2009, 10, 6, 2, 0), 0), 8
in Python,
they're same.
And as I said, I also tried using String as the key, which also failed.
From: so...@cloudera.com
Date: Sat, 18 Apr 2015 16:30:59 +0100
Subject: Re: Does reduceByKey only work properly for numeric keys?
To: lovejay-lovemu...@outlook.com
CC: user@spark.apache.org
Do
. The operating
system is ArchLinux.
And, there is a node, running x86 Arch Linux, while the others x86_64.
The problem arises, as long as the x86 node and x64 nodes works together.
Nothing wrong if there is only a x86 node in the cluster, or just x64 nodes.
And currently only reduceByKey with int32
the hash value from
converted to long. And it seems there is no duplication.
(All conclusion above came from simple sampling from the returned dataset and
is not well validated)
From: lovejay-lovemu...@outlook.com
To: so...@cloudera.com
CC: user@spark.apache.org
Subject: RE: Does reduceByKey only work
is ArchLinux.
And, there is a node, running x86 Arch Linux, while the others x86_64.
The problem arises, as long as the x86 node and x64 nodes works together.
Nothing wrong if there is only a x86 node in the cluster, or just x64
nodes. And currently only reduceByKey with int32 keys makes sense.
Maybe
Hi Imran,
I will say your explanation is extremely helpful J
I tested some ideas according to your explanation and it make perfect sense to
me. I modify my code to use cogroup+mapValues instead of union+reduceByKey to
preserve the partition, which gives me more than 100% performance gain
Hi Sathish,
The current implementation of countByKey uses reduceByKey:
https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332
It seems that countByKey is mostly deprecated:
https://issues.apache.org/jira/browse/SPARK-3994
-Jey
On Tue
Hello,
Quick question. I am trying to understand difference between reduceByKey vs
countByKey? Which one gives better performance reduceByKey or countByKey?
While we can perform same count operation using reduceByKey why we need
countByKey/countByValue?
Thanks
Sathish
enough put in memory), how can I access other RDD's local partition in the
mapParitition method? Is it anyway to do this in Spark?
From: Shao, Saisai [mailto:saisai.s...@intel.com]
Sent: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark itself do
not know the data is already well partitioned.
So if you want to avoid Shuffle, you have to write the code explicitly to
avoid this, from my
))
}
ranks = contributions.union(leakedMatrix).reduceByKey(_ + _)
}
ranks.lookup(1)
In above code, links will join ranks and should preserve the partition, and
leakedMatrix also share the same partition, so I expect there is no shuffle
happen on the contributions.union(leakedMatrix
: Monday, February 23, 2015 3:13 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger shuffle even same partition?
If you call reduceByKey(), internally Spark will introduce a shuffle
operations, not matter the data is already partitioned locally, Spark
I think some RDD APIs like zipPartitions or others can do this as you wanted. I
might check the docs.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 1:35 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: RE: Union and reduceByKey will trigger
I've no context of this book, AFAIK union will not trigger shuffle, as they
just put the partitions together, the operator reduceByKey() will actually
trigger shuffle.
Thanks
Jerry
From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: Monday, February 23, 2015 12:26 PM
To: Shao, Saisai
Cc
, which is not really what is always causing the
shuffle. Eg., reduceByKey() causes a shuffle, but we don't see that in a
stage name. Similarly, map()
does not cause a shuffle, but we see a stage with that name.
So, what do the stage boundaries we see actually correspond to?
1) map -- that is doing
Hi,
I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each stage?
Thanks,
Josh
I believe From the webui (running on port 8080) you will get these
measurements.
Thanks
Best Regards
On Sat, Jan 31, 2015 at 9:29 AM, Josh J joshjd...@gmail.com wrote:
Hi,
I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:
http://pastebin.com/70M5d0Bn
Any ideas how I can fix that?
Thanks.
-
To unsubscribe, e-mail
works ok, but
as long as I try to reduceByKey, I get out of memory errors:
http://pastebin.com/70M5d0Bn
Any ideas how I can fix that?
Thanks.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e
equally to all the tasks?
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).
*reduceByKey*(lambda a, b: a+b).*repartition(2)*
.saveAsTextFile(hdfs://localhost:9000/user/praveen/output/)
Thanks,
Praveen
--
- Rishi
(...); // the group-function is just creating a hashcode
from some fields in MyClass.
Now I want to reduce the variable grouped. However, I want to reduce it with
different functions depending on the key in the JavaPairRDD. So basically a
reduceByKey but with multiple functions.
Only solution I've come up
is just creating a hashcode
from some fields in MyClass.
Now I want to reduce the variable grouped. However, I want to reduce it with
different functions depending on the key in the JavaPairRDD. So basically a
reduceByKey but with multiple functions.
Only solution I've come up
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...
reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use
, Tuple2lt;K, V(t._1(),new
Tuple2K,V(t._1(),t._2());
}
});
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html
Sent from the Apache Spark User List
I am missing out , I tried changing the
serializer but still getting similar error.
Place the error here http://pastebin.com/0tqiiJQm
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object
) File
/home/sid/Downloads/spark/python/pyspark/serializers.py, line 464, in
read_int raise EOFError EOFError
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317.html
Sent from
I have implemented the advise and now the issue is resolved .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17426.html
Sent from the Apache Spark User List mailing list archive
it could.
Is ReduceWords actually an inner class?
Or on another tangent, when you remove reduceByKey, you are also
removing print? that would cause it to do nothing, which of course
generates no error.
On Wed, Oct 15, 2014 at 12:11 AM, Abraham Jacob abe.jac...@gmail.com wrote:
Hi All,
I am
the reduceBykey) and
reading the log file lead me to believe that there was something wrong with
the way I implemented the reduceByKey. In fact there was nothing wrong
with the reduceByKey implementation. Just for closure (no pun intended), i
will try and explain what happened. Maybe it will help someone else
there for almost 1 hour.
I guess I can only go with random initialization in KMeans.
Thanks again for your help.
Ray
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html
Sent from the Apache Spark User List
:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
observable
hanging.
Hopefully this provides more information.
Thanks.
Ray
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Hi Ray,
The reduceByKey / collectAsMap does a lot of calculations. Therefore it can
take a very long time if:
1) The parameter number of runs is set very high
2) k is set high (you have observed this already)
3) data is not properly repartitioned
It seems that it is hanging, but there is a lot
be an active stage with an incomplete progress bar in
the UI. Am I wrong?
Thanks, Burak!
Ray
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16438.html
Sent from the Apache Spark User List mailing
...
The moment I introduce the reduceBykey, I start getting the following
error and spark streaming shuts down -
14/10/14 17:58:45 ERROR JobScheduler: Error running job streaming job
1413323925000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable
I saw similar bottleneck in reduceByKey operation. Maybe we can
implement treeReduceByKey to reduce the pressure on single executor
reducing the particular key.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https
the input data? If yes, could you check
the storage tab of Spark WebUI and see how the data is distributed
across executors. -Xiangrui
On Tue, Oct 14, 2014 at 4:26 PM, DB Tsai dbt...@dbtsai.com wrote:
I saw similar bottleneck in reduceByKey operation. Maybe we can
implement treeReduceByKey
the reduceBykey, I start getting the following
error and spark streaming shuts down -
14/10/14 17:58:45 ERROR JobScheduler: Error running job streaming job
1413323925000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task
not serializable
ReduceWords());
wordCount.print();
jssc.start();
jssc.awaitTermination();
return 0;
If I remove the code (highlighted) JavaPairDStreamString, Integer
wordCount = wordMap.reduceByKey(new ReduceWords());, the application works
just fine...
The moment I introduce the reduceBykey, I start getting
in 11
iterations.
14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
seconds.*
14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
KMeans.scala:190
14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
KMeans.scala:190)
14/10/14 15:54:37 INFO
in 11
iterations.
14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
seconds.*
14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
KMeans.scala:190
14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
KMeans.scala:190)
14/10/14 15:54
to it
after I upgrade to Spark 1.1.0.
-Kris
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-aggregation-results-when-reusing-objects-inside-reduceByKey-tp14287p14835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
aggregation:
class Record (val PrimaryId: Int,
val SubId: Int,
var Event1Count: Int,
var Event2Count: Int) extends Serializable {
}
Then once I have an RDD I do a reduce by key:
val allAgr = all.map(x = (s${x.PrimaryId}-${x.SubId}, x)).reduceByKey
{ (l, r
Serializable {
}
Then once I have an RDD I do a reduce by key:
val allAgr = all.map(x = (s${x.PrimaryId}-${x.SubId}, x)).reduceByKey
{ (l, r) =
l.Event1Count= l.Event1Count+ r.Event1Count
l.Event2Count= l.Event2Count+ r.Event2Count
l
}.map(x = x._2)
The problem
Hello,
I am facing performance issues with reduceByKey. In know that this topic
has already been covered but I did not really find answers to my question.
I am using reduceByKey to remove entries with identical keys, using, as
reduce function, (a,b) = a. It seems to be a relatively
If you are just looking for distinct keys, .keys.distinct() should be
much better.
On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com wrote:
Hello,
I am facing performance issues with reduceByKey. In know that this topic has
already been covered but I did not really find
, .keys.distinct() should be
much better.
On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com
wrote:
Hello,
I am facing performance issues with reduceByKey. In know that this topic
has
already been covered but I did not really find answers to my question.
I am
]
x.map(obj = obj.fieldtobekey - obj).reduceByKey { case (l, _) = l }
Does that make sense?
On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme julien.ca...@gmail.com
wrote:
I need to remove objects with duplicate key, but I need the whole object.
Object which have the same key are not necessarily
too.
On Sat, Sep 13, 2014 at 1:15 PM, Gary Malouf malouf.g...@gmail.com
wrote:
You need something like:
val x: RDD[MyAwesomeObject]
x.map(obj = obj.fieldtobekey - obj).reduceByKey { case (l, _) = l }
Does that make sense?
On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme julien.ca
Hello spark aficionados,
We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and
started noticing some weird errors. Even a simple operation like
reduceByKey or count on an RDD gets stuck in cluster mode. This issue
does not occur with spark 1.0.0 (in cluster or local mode
Hi there,
I'm interested if it is possible to get the same behavior as for reduce
function from MR framework. I mean for each key K get list of associated
values ListV.
There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have
values ListV.
There is function reduceByKey that works only with separate V from list. Is
it exist any way to get list? Because I have to sort it in particular way and
apply some business logic.
Thank you in advance,
Konstantin Kudryavtsev
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about
performance
(http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/)
that, reduceByKey will be more efficient than groupByKey... he mentioned
groupByKey copies all data over network
The point is that in many cases the operation passed to reduceByKey aggregates
data into much smaller size, say + and * for integer. String concatenation
doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and
rdd.groupByKey suffer similar performance issue. In general
Specifically, reduceByKey expects a commutative/associative reduce
operation, and will automatically do this locally before a shuffle, which
means it acts like a combiner in MapReduce terms -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
On Thu
-- v4
v2 -- v6
v4 -- v2
v4 -- v6
v6 -- v2
v6 -- v4
Does anyone have advice on what will be the best way to do that over a
graph instance?
I attempted to do it using mapReduceTriplets but I need the reduce function
to work like reduceByKey, which I wasn't able to do.
Thank you.
-- Omer
Hi,
I am a complete newbie to spark and map reduce frameworks and have a basic
question on the reduce function. I was working on the word count example
and was stuck at the reduce stage where the sum happens.
I am trying to understand the working of the reducebykey in Spark using
java
to understand the working of the reducebykey in Spark using java
as the programming language.
Say if I have a sentence I am who I am I break the sentence into words and
store as list [I, am, who, I, am]
now this function assigns 1 to each word
JavaPairRDD ones = words.mapToPair(new PairFunction
Hi All,
I was able to resolve this matter with a simple fix. It seems that in order
to process a reduceByKey and the flat map operations at the same time, the
only way to resolve was to increase the number of threads to 1.
Since I'm developing on my personal machine for speed, I simply updated
Hi all, I recently just picked up Spark and am trying to work through a
coding issue that involves the reduceByKey method. After various debugging
efforts, it seems that the reducyByKey method never gets called.
Here's my workflow, which is followed by my code and results:
My parsed data
I guess this is a basic question about the usage of reduce. Please shed some
lights, thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html
Sent from the Apache Spark User List mailing
this is a basic question about
reduce function.
I will appreciate any help, thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have a simple Java class as follows, that I want to use as a key while
applying groupByKey or reduceByKey functions:
private static class FlowId {
public String dcxId;
public String trxId;
public String msgType
as a key while
applying groupByKey or reduceByKey functions:
private static class FlowId {
public String dcxId;
public String trxId;
public String msgType;
public FlowId(String dcxId, String trxId, String msgType
101 - 200 of 234 matches
Mail list logo