Re: reduceByKey - add values to a list

2015-06-25 Thread Sven Krasser
Hey Kannappan, First of all, what is the reason for avoiding groupByKey since this is exactly what it is for? If you must use reduceByKey with a one-liner, then take a look at this: lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else [b]) In contrast to groupByKey

Re: reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Thanks. This should work fine. I am trying to avoid groupByKey for performance reasons as the input is a giant RDD. and the operation is a associative operation, so minimal shuffle if done via reduceByKey. On Jun 26, 2015, at 12:25 AM, Sven Krasser kras...@gmail.com wrote: Hey Kannappan

Re: reduceByKey - add values to a list

2015-06-25 Thread Sven Krasser
In that case the reduceByKey operation will likely not give you any benefit (since you are not aggregating data into smaller values but instead building the same large list you'd build with groupByKey). If you look at rdd.py, you can see that both operations eventually use a similar operation

Re: reduceByKey - add values to a list

2015-06-25 Thread Sven Krasser
On Thu, Jun 25, 2015 at 5:01 PM, Kannappan Sirchabesan buildka...@gmail.com wrote: On Jun 26, 2015, at 12:46 AM, Sven Krasser kras...@gmail.com wrote: In that case the reduceByKey operation will likely not give you any benefit (since you are not aggregating data into smaller values

Re: reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
On Jun 26, 2015, at 12:46 AM, Sven Krasser kras...@gmail.com wrote: In that case the reduceByKey operation will likely not give you any benefit (since you are not aggregating data into smaller values but instead building the same large list you'd build with groupByKey). great. thanks

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Hi Marchelo, The collected data are collected in say class C. c.id is the id of each of those data. But that id might appear more than once in those 1mil xml files, so I am doing a reduceByKey(). Even if I had multiple binaryFile RDD's, wouldn't I have to ++ those in order to correctly

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Now I am profiling the executor. There seems to be a memory leak. 20 mins after the run there were: 157k byte[] allocated for 75MB. 519k java.lang.ref.Finalizer for 31MB, 291k java.util.zip.Inflater for 17MB 487k java.util.zip.ZStreamRef for 11MB An hour after the run I got : 186k byte[]

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
..and keeps on increasing. maybe there is a bug in some code that zips/unzips data. 109k instances of byte[] followed by 1 mil instances of Finalizer, with ~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef I assume that's due to either binaryFiles or saveAsObjectFile On 11/06/15

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[] So probably information about the files again. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but

Re: ReduceByKey with a byte array as the key

2015-06-11 Thread Sonal Goyal
I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
Subject: Re: ReduceByKey with a byte array as the key I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Aaron Davidson
between this and using `String` would be similar enough to warrant just using `String`. Mark *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com] *Sent:* June-11-15 12:58 PM *To:* Mark Tse *Cc:* user@spark.apache.org *Subject:* Re: ReduceByKey with a byte array as the key I think if you

spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
).reduceByKey( reducer ).saveAsObjectFile() Initially I had it with groupBy but that method uses a lot of resources (according to the javadocs). Switching to reduceByKey didn't have any effect. Seems like spark goes into 2 cycles of calculations of ~270k of items. In the 1st round, around 15GB

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it is still running. What are those long[] used for? -- View this message in context:

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Marcelo Vanzin
So, I don't have an explicit solution to your problem, but... On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios kostas.koug...@googlemail.com wrote: I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
After some time the driver accumulated 6.67GB of long[] . The executor mem usage so far is low. -- View this message in context:

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Igor Berman
and running reduce on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
+ chill-avro(mapping avro object to simple scala case class and running reduce on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html Sent from the Apache Spark

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
class and running reduce on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread Igor Berman
on these case class objects solves the problem) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/union-and-reduceByKey-wrong-shuffle-tp23092p23093.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: reduceByKey

2015-05-14 Thread Gaspar Muñoz
Functions.ListStringToInt()) .reduceByKey(new SumList()); Functions implementation using Java 7. Java 8 should be more simple. public static final class SumList implements Function2ListInteger, ListInteger, ListInteger { @Override public ListInteger call(ListInteger l1, ListInteger l2) throws

reduceByKey

2015-05-14 Thread Yasemin Kaya
Hi, I have JavaPairRDDString, String and I want to implement reduceByKey method. My pairRDD : *2553: 0,0,0,1,0,0,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *2553: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* I want to get : *2553: 0,0,0,1,0,1,0,0* 46551

Re: reduceByKey

2015-05-14 Thread ayan guha
Kaya godo...@gmail.com wrote: Hi, I have JavaPairRDDString, String and I want to implement reduceByKey method. My pairRDD : *2553: 0,0,0,1,0,0,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *2553: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* I want to get

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Koert Kuipers
shoot me an email if you need any help with spark-sorted. it does not (yet?) have a java api, so you will have to work in scala On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote: I think this Spark Package may be what you're looking for!

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Imran Rashid
oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth exposing in combineByKey()? I think Jerry's proposed workaround is all you can do for now -- use reflection to side-step the fact that the methods you need are private. On Mon, Apr 27, 2015 at 8:07 AM,

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote: oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth

Re: ReduceByKey and sorting within partitions

2015-04-29 Thread Marco
of this aggregation is sorted. A way to do that can be something like flatpMapToPair(myMapFunc).reduceByKey(RangePartitioner,myReduceFunc).mapPartition(i - sort(i)). But I was thinking that the sorting phase can be pushed down to the shuffle phase, as the same thing is done in sortByKey

ReduceByKey and sorting within partitions

2015-04-27 Thread Marco
Hi, I'm trying, after reducing by key, to get data ordered among partitions (like RangePartitioner) and within partitions (like sortByKey or repartitionAndSortWithinPartition) pushing the sorting down to the shuffles machinery of the reducing phase. I think, but maybe I'm wrong, that the correct

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao
Hi Marco, As I know, current combineByKey() does not expose the related argument where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is package private, if you can get the ShuffledRDD through reflection or other way, the keyOrdering you set will be pushed down to shuffle. If you

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya
Standard Time To: user@spark.apache.org Subject: ReduceByKey and sorting within partitions Hi, I'm trying, after reducing by key, to get data ordered among partitions (like RangePartitioner) and within partitions (like sortByKey or repartitionAndSortWithinPartition) pushing the sorting down

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
Can you show us the function you passed to reduceByKey() ? What release of Spark are you using ? Cheers On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke lovejay-lovemu...@outlook.com wrote: I'm trying to solve a Word-Count like problem, the difference lies in that, I need the count of a specific

Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
is represented with Python datetime.datetime class. And I continued to transform it to ((time, word_id), 1) then use reduceByKey for result. But the dataset returned is a little weird, just like the following: format:((timespan with datetime.datetime, wordid), freq) ((datetime.datetime(2009, 10, 6, 2

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Sean Owen
datetime.datetime class. And I continued to transform it to ((time, word_id), 1) then use reduceByKey for result. But the dataset returned is a little weird, just like the following: format: ((timespan with datetime.datetime, wordid), freq) ((datetime.datetime(2009, 10, 6, 2, 0), 0), 8

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
in Python, they're same. And as I said, I also tried using String as the key, which also failed. From: so...@cloudera.com Date: Sat, 18 Apr 2015 16:30:59 +0100 Subject: Re: Does reduceByKey only work properly for numeric keys? To: lovejay-lovemu...@outlook.com CC: user@spark.apache.org Do

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
. The operating system is ArchLinux. And, there is a node, running x86 Arch Linux, while the others x86_64. The problem arises, as long as the x86 node and x64 nodes works together. Nothing wrong if there is only a x86 node in the cluster, or just x64 nodes. And currently only reduceByKey with int32

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
the hash value from converted to long. And it seems there is no duplication. (All conclusion above came from simple sampling from the returned dataset and is not well validated) From: lovejay-lovemu...@outlook.com To: so...@cloudera.com CC: user@spark.apache.org Subject: RE: Does reduceByKey only work

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
is ArchLinux. And, there is a node, running x86 Arch Linux, while the others x86_64. The problem arises, as long as the x86 node and x64 nodes works together. Nothing wrong if there is only a x86 node in the cluster, or just x64 nodes. And currently only reduceByKey with int32 keys makes sense. Maybe

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-24 Thread Shuai Zheng
Hi Imran, I will say your explanation is extremely helpful J I tested some ideas according to your explanation and it make perfect sense to me. I modify my code to use cogroup+mapValues instead of union+reduceByKey to preserve the partition, which gives me more than 100% performance gain

Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish, The current implementation of countByKey uses reduceByKey: https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332 It seems that countByKey is mostly deprecated: https://issues.apache.org/jira/browse/SPARK-3994 -Jey On Tue

reduceByKey vs countByKey

2015-02-24 Thread Sathish Kumaran Vairavelu
Hello, Quick question. I am trying to understand difference between reduceByKey vs countByKey? Which one gives better performance reduceByKey or countByKey? While we can perform same count operation using reduceByKey why we need countByKey/countByValue? Thanks Sathish

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
enough put in memory), how can I access other RDD's local partition in the mapParitition method? Is it anyway to do this in Spark? From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: Monday, February 23, 2015 3:13 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: RE: Union and reduceByKey

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark itself do not know the data is already well partitioned. So if you want to avoid Shuffle, you have to write the code explicitly to avoid this, from my

Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
)) } ranks = contributions.union(leakedMatrix).reduceByKey(_ + _) } ranks.lookup(1) In above code, links will join ranks and should preserve the partition, and leakedMatrix also share the same partition, so I expect there is no shuffle happen on the contributions.union(leakedMatrix

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shuai Zheng
: Monday, February 23, 2015 3:13 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger shuffle even same partition? If you call reduceByKey(), internally Spark will introduce a shuffle operations, not matter the data is already partitioned locally, Spark

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I think some RDD APIs like zipPartitions or others can do this as you wanted. I might check the docs. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 1:35 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: RE: Union and reduceByKey will trigger

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Shao, Saisai
I've no context of this book, AFAIK union will not trigger shuffle, as they just put the partitions together, the operator reduceByKey() will actually trigger shuffle. Thanks Jerry From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Monday, February 23, 2015 12:26 PM To: Shao, Saisai Cc

Re: Union and reduceByKey will trigger shuffle even same partition?

2015-02-23 Thread Imran Rashid
, which is not really what is always causing the shuffle. Eg., reduceByKey() causes a shuffle, but we don't see that in a stage name. Similarly, map() does not cause a shuffle, but we see a stage with that name. So, what do the stage boundaries we see actually correspond to? 1) map -- that is doing

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh

Re: measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Akhil Das
I believe From the webui (running on port 8080) you will get these measurements. Thanks Best Regards On Sat, Jan 31, 2015 at 9:29 AM, Josh J joshjd...@gmail.com wrote: Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail

Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
equally to all the tasks? textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). *reduceByKey*(lambda a, b: a+b).*repartition(2)* .saveAsTextFile(hdfs://localhost:9000/user/praveen/output/) Thanks, Praveen -- - Rishi

ReduceByKey but with different functions depending on key

2014-11-18 Thread jelgh
(...); // the group-function is just creating a hashcode from some fields in MyClass. Now I want to reduce the variable grouped. However, I want to reduce it with different functions depending on the key in the JavaPairRDD. So basically a reduceByKey but with multiple functions. Only solution I've come up

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Yanbo
is just creating a hashcode from some fields in MyClass. Now I want to reduce the variable grouped. However, I want to reduce it with different functions depending on the key in the JavaPairRDD. So basically a reduceByKey but with multiple functions. Only solution I've come up

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use

Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread lordjoe
, Tuple2lt;K, V(t._1(),new Tuple2K,V(t._1(),t._2()); } }); } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html Sent from the Apache Spark User List

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Gen
I am missing out , I tried changing the serializer but still getting similar error. Place the error here http://pastebin.com/0tqiiJQm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread Davies Liu
) File /home/sid/Downloads/spark/python/pyspark/serializers.py, line 464, in read_int raise EOFError EOFError -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317.html Sent from

Re: Python code crashing on ReduceByKey if I return custom class object

2014-10-27 Thread sid
I have implemented the advise and now the issue is resolved . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17426.html Sent from the Apache Spark User List mailing list archive

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-15 Thread Sean Owen
it could. Is ReduceWords actually an inner class? Or on another tangent, when you remove reduceByKey, you are also removing print? that would cause it to do nothing, which of course generates no error. On Wed, Oct 15, 2014 at 12:11 AM, Abraham Jacob abe.jac...@gmail.com wrote: Hi All, I am

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-15 Thread Abraham Jacob
the reduceBykey) and reading the log file lead me to believe that there was something wrong with the way I implemented the reduceByKey. In fact there was nothing wrong with the reduceByKey implementation. Just for closure (no pun intended), i will try and explain what happened. Maybe it will help someone else

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
there for almost 1 hour. I guess I can only go with random initialization in KMeans. Thanks again for your help. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html Sent from the Apache Spark User List

Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
observable hanging. Hopefully this provides more information. Thanks. Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
be an active stage with an incomplete progress bar in the UI. Am I wrong? Thanks, Burak! Ray -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16438.html Sent from the Apache Spark User List mailing

Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
... The moment I introduce the reduceBykey, I start getting the following error and spark streaming shuts down - 14/10/14 17:58:45 ERROR JobScheduler: Error running job streaming job 1413323925000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread DB Tsai
I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey to reduce the pressure on single executor reducing the particular key. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
the input data? If yes, could you check the storage tab of Spark WebUI and see how the data is distributed across executors. -Xiangrui On Tue, Oct 14, 2014 at 4:26 PM, DB Tsai dbt...@dbtsai.com wrote: I saw similar bottleneck in reduceByKey operation. Maybe we can implement treeReduceByKey

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
the reduceBykey, I start getting the following error and spark streaming shuts down - 14/10/14 17:58:45 ERROR JobScheduler: Error running job streaming job 1413323925000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-14 Thread Abraham Jacob
ReduceWords()); wordCount.print(); jssc.start(); jssc.awaitTermination(); return 0; If I remove the code (highlighted) JavaPairDStreamString, Integer wordCount = wordMap.reduceByKey(new ReduceWords());, the application works just fine... The moment I introduce the reduceBykey, I start getting

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
in 11 iterations. 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913 seconds.* 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at KMeans.scala:190 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at KMeans.scala:190) 14/10/14 15:54:37 INFO

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Xiangrui Meng
in 11 iterations. 14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913 seconds.* 14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at KMeans.scala:190 14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at KMeans.scala:190) 14/10/14 15:54

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-22 Thread kriskalish
to it after I upgrade to Spark 1.1.0. -Kris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-aggregation-results-when-reusing-objects-inside-reduceByKey-tp14287p14835.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread kriskalish
aggregation: class Record (val PrimaryId: Int, val SubId: Int, var Event1Count: Int, var Event2Count: Int) extends Serializable { } Then once I have an RDD I do a reduce by key: val allAgr = all.map(x = (s${x.PrimaryId}-${x.SubId}, x)).reduceByKey { (l, r

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread Sean Owen
Serializable { } Then once I have an RDD I do a reduce by key: val allAgr = all.map(x = (s${x.PrimaryId}-${x.SubId}, x)).reduceByKey { (l, r) = l.Event1Count= l.Event1Count+ r.Event1Count l.Event2Count= l.Event2Count+ r.Event2Count l }.map(x = x._2) The problem

ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) = a. It seems to be a relatively

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
If you are just looking for distinct keys, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com wrote: Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com wrote: Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
] x.map(obj = obj.fieldtobekey - obj).reduceByKey { case (l, _) = l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme julien.ca...@gmail.com wrote: I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
too. On Sat, Sep 13, 2014 at 1:15 PM, Gary Malouf malouf.g...@gmail.com wrote: You need something like: val x: RDD[MyAwesomeObject] x.map(obj = obj.fieldtobekey - obj).reduceByKey { case (l, _) = l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme julien.ca

Jobs get stuck at reduceByKey stage with spark 1.0.1

2014-08-12 Thread Shivani Rao
Hello spark aficionados, We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and started noticing some weird errors. Even a simple operation like reduceByKey or count on an RDD gets stuck in cluster mode. This issue does not occur with spark 1.0.0 (in cluster or local mode

reduceByKey to get all associated values

2014-08-07 Thread Konstantin Kudryavtsev
Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
values ListV. There is function reduceByKey that works only with separate V from list. Is it exist any way to get list? Because I have to sort it in particular way and apply some business logic. Thank you in advance, Konstantin Kudryavtsev

Re: reduceByKey to get all associated values

2014-08-07 Thread chutium
a long time ago, in Spark Summit 2013, Patrick Wendell said in his talk about performance (http://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/) that, reduceByKey will be more efficient than groupByKey... he mentioned groupByKey copies all data over network

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general

Re: reduceByKey to get all associated values

2014-08-07 Thread Evan R. Sparks
Specifically, reduceByKey expects a commutative/associative reduce operation, and will automatically do this locally before a shuffle, which means it acts like a combiner in MapReduce terms - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions On Thu

graph reduceByKey

2014-08-05 Thread Omer Holzinger
-- v4 v2 -- v6 v4 -- v2 v4 -- v6 v6 -- v2 v6 -- v4 Does anyone have advice on what will be the best way to do that over a graph instance? I attempted to do it using mapReduceTriplets but I need the reduce function to work like reduceByKey, which I wasn't able to do. Thank you. -- Omer

Spark ReduceByKey - Working in Java

2014-08-02 Thread Anil Karamchandani
Hi, I am a complete newbie to spark and map reduce frameworks and have a basic question on the reduce function. I was working on the word count example and was stuck at the reduce stage where the sum happens. I am trying to understand the working of the reducebykey in Spark using java

Re: Spark ReduceByKey - Working in Java

2014-08-02 Thread Sean Owen
to understand the working of the reducebykey in Spark using java as the programming language. Say if I have a sentence I am who I am I break the sentence into words and store as list [I, am, who, I, am] now this function assigns 1 to each word JavaPairRDD ones = words.mapToPair(new PairFunction

Re: reduceByKey Not Being Called by Spark Streaming

2014-07-03 Thread Dan H.
Hi All, I was able to resolve this matter with a simple fix. It seems that in order to process a reduceByKey and the flat map operations at the same time, the only way to resolve was to increase the number of threads to 1. Since I'm developing on my personal machine for speed, I simply updated

reduceByKey Not Being Called by Spark Streaming

2014-07-02 Thread Dan H.
Hi all, I recently just picked up Spark and am trying to work through a coding issue that involves the reduceByKey method. After various debugging efforts, it seems that the reducyByKey method never gets called. Here's my workflow, which is followed by my code and results: My parsed data

Re: Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
I guess this is a basic question about the usage of reduce. Please shed some lights, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html Sent from the Apache Spark User List mailing

Cannot print a derived DStream after reduceByKey

2014-06-18 Thread haopu
this is a basic question about reduce function. I will appreciate any help, thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7837.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Gaurav Jain
I have a simple Java class as follows, that I want to use as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType

Re: Using custom class as a key for groupByKey() or reduceByKey()

2014-06-15 Thread Sean Owen
as a key while applying groupByKey or reduceByKey functions: private static class FlowId { public String dcxId; public String trxId; public String msgType; public FlowId(String dcxId, String trxId, String msgType

<    1   2   3   >