get something like:
v2 --> v4
v2 --> v6
v4 --> v2
v4 --> v6
v6 --> v2
v6 --> v4
Does anyone have advice on what will be the best way to do that over a
graph instance?
I attempted to do it using mapReduceTriplets but I need the reduce function
to work like reduceByKey, which I wasn't able to do.
Thank you.
-- Omer
m happens.
>
> I am trying to understand the working of the reducebykey in Spark using java
> as the programming language.
>
> Say if I have a sentence "I am who I am" I break the sentence into words and
> store as list [I, am, who, I, am]
>
> now this function assi
Hi,
I am a complete newbie to spark and map reduce frameworks and have a basic
question on the reduce function. I was working on the word count example
and was stuck at the reduce stage where the sum happens.
I am trying to understand the working of the reducebykey in Spark using
java as the
Hi All,
I was able to resolve this matter with a simple fix. It seems that in order
to process a reduceByKey and the flat map operations at the same time, the
only way to resolve was to increase the number of threads to > 1.
Since I'm developing on my personal machine for speed,
Hi all, I recently just picked up Spark and am trying to work through a
coding issue that involves the reduceByKey method. After various debugging
efforts, it seems that the reducyByKey method never gets called.
Here's my workflow, which is followed by my code and results:
My parsed
ut
reduce function.
I will appreciate any help, thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I guess this is a basic question about the usage of reduce. Please shed some
lights, thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html
Sent from the Apache Spark User List mailing
>> I have a simple Java class as follows, that I want to use as a key while
>> applying groupByKey or reduceByKey functions:
>>
>> private static class FlowId {
>> public String dcxId;
>> public String trxId;
>>
ey while
> applying groupByKey or reduceByKey functions:
>
> private static class FlowId {
> public String dcxId;
> public String trxId;
> public String msgType;
>
> public FlowId(String dcx
I have a simple Java class as follows, that I want to use as a key while
applying groupByKey or reduceByKey functions:
private static class FlowId {
public String dcxId;
public String trxId;
public String msgType
iners, mergeValue, etc) and return that instead of
> allocating a new object. So it should work with mutable objects — please
> post what problems you had with that. reduceByKey actually also allows this
> if your types are the same.
>
> Matei
>
>
> On Jun 11, 2014, at 3:21 PM,
> (10.20.11.3,Set(10.10.61.95))
> ...
>
>
>
> What I want is a SET of (sourceIP -> Set(all the destination Ips)) Using
> a set because as you can see above, the same source may have the same
> destination multiple times and I want to eliminate dupes on the destination
sourceIP -> Set(all the destination Ips)) Using a
set because as you can see above, the same source may have the same
destination multiple times and I want to eliminate dupes on the destination
side.
When I call the reduceByKey() method, I never get any output. When I do a
"srcDestinat
allocating a new
object. So it should work with mutable objects — please post what problems you
had with that. reduceByKey actually also allows this if your types are the same.
Matei
On Jun 11, 2014, at 3:21 PM, Diana Hu wrote:
> Hello all,
>
> I've seen some performance impr
Hello all,
I've seen some performance improvements using combineByKey as opposed to
reduceByKey or a groupByKey+map function. I have a couple questions. it'd
be great if any one can provide some light into this.
1) When should I use combineByKey vs reduceByKey?
2) Do the containers
Thanks. it worked.
2014-05-30 1:53 GMT+08:00 Matei Zaharia :
> That hash map is just a list of where each task ran, it’s not the actual
> data. How many map and reduce tasks do you have? Maybe you need to give the
> driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ +
That hash map is just a list of where each task ran, it’s not the actual data.
How many map and reduce tasks do you have? Maybe you need to give the driver a
bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use
only 100 tasks).
Matei
On May 29, 2014, at 2:03 AM, haitao
Hi,
I used 1g memory for the driver java process and got OOM error on
driver side before reduceByKey. After analyzed the heap dump, the biggest
object is org.apache.spark.MapStatus, which occupied over 900MB memory.
Here's my question:
1. Is there any optimization switches that I can
; roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http:/
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
x27;m very interested in your findings.
>
>
>
> -
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be
sparkcontext object solved the problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-net-SocketException-on-reduceByKey-in-pyspark-tp2184p4612.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
fference at all.
> /usr/local/pkg/spark is also on all the nodes... it has to be in order to
> get all the nodes up and running in the cluster. Also, I'm confused by what
> you mean with "That's most probably the class that implements the closure
> you're passi
t; using println to debug is great for me to explore spark.
>>>> thank you very much for your kindly help.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
>>>> daniel.dara...@lynxanalytics.com> wrote:
Apr 18, 2014 at 12:54 AM, Daniel Darabos <
>>> daniel.dara...@lynxanalytics.com> wrote:
>>>
>>>> Here's a way to debug something like this:
>>>>
>>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
>>&g
Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> Here's a way to debug something like this:
>>>
>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
>>>
me to explore spark.
> thank you very much for your kindly help.
>
>
>
> On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>> Here's a way to debug something like this:
>>
>> scala> d5.keyBy(_.split(
d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
>println("v1: " + v1)
>println("v2: " + v2)
>(v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString
>}).collect
>
> You get:
> v1
Here's a way to debug something like this:
scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
println("v1: " + v1)
println("v2: " + v2)
(v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString
dd.RDD[String] = MappedRDD[91] at textFile at
:12
scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => (v1.split("
")(1).toInt + v2.split(" ")(1).toInt).toString).first
then error occurs:
14/04/18 00:20:11 ERROR Executor: Exception in task ID 36
java.lang.ArrayI
s the closure you're passing
as an argument to the reduceByKey() method". The only thing I'm passing to
it is "_ + _".. and as you mentioned, its pretty much the same as the map()
method.
If I run the following code, it runs 100% properly on the cluster:
val nu
hat's most probably the
class that implements the closure you're passing as an argument to the
reduceByKey() method. Although I can't really explain why the same
isn't happening for the closure you're passing to map()...
Sorry I can't be more helpful.
> I still get the err
s trouble finding the code that is itself. And why
only with the reduceByKey function is it occuring? I have no problems with
any other code running except for that. (BTW, I don't use in
my code above... I just removed it for security purposes.)
Thanks,
Ian
On Mon, Apr 14, 2014 at 12:45 PM
pr 14, 2014 at 9:17 AM, Ian Bonnycastle wrote:
> Good afternoon,
>
> I'm attempting to get the wordcount example working, and I keep getting an
> error in the "reduceByKey(_ + _)" call. I've scoured the mailing lists, and
> haven't been able to find a sure fire
Good afternoon,
I'm attempting to get the wordcount example working, and I keep getting an
error in the "reduceByKey(_ + _)" call. I've scoured the mailing lists, and
haven't been able to find a sure fire solution, unless I'm missing
something big. I did find somethi
iteration is the same. However, I found
that during the first iteration the reduceByKey() (line 162) has 6 tasks and
during the second iteration it has 18 tasks and third iteration 54 tasks,
fourth iteration 162 tasks..
during the sixth iteration it has 1458 tasks which almost costs more
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui
On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming wrote:
> Thanks again.
>
>> If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master,
>
> The "master" here is the app/driver/spark-sh
Thanks again.
> If you use the KMeans implementation from MLlib, the
> initialization stage is done on master,
The “master” here is the app/driver/spark-shell?
Thanks!
On 25 Mar, 2014, at 1:03 am, Xiangrui Meng wrote:
> Number of rows doesn't matter much as long as you have enough workers
>
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on m
Thanks, Let me try with a smaller K.
Does the size of the input data matters for the example? Currently I have 50M
rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng wrote:
> K = 50 is certainly a large number for k-means.
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.
-Xiangrui
Hi,
This is on a 4 nodes cluster each with 32 cores/256GB Ram.
(0.9.0) is deployed in a stand alone mode.
Each worker is configured with 192GB. Spark executor memory is also 192GB.
This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a copy-
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote:
> Hi,
>
> At the reduceBuyKey stage, it takes a few minutes befo
Hi,
At the reduceBuyKey stage, it takes a few minutes before the tasks start
working.
I have -Dspark.default.parallelism=127 cores (n-1).
CPU/Network/IO is idling across all nodes when this is happening.
And there is nothing particular on the master log file. From the spark-shell:
14/03/23 1
www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini wrote:
>
>>
>> val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function)
>>
>> I see that rdd2's partitions are not
y(my partitioner).reduceByKey(some function)
>
> I see that rdd2's partitions are not internally sorted. Can someone
> confirm that this is expected behavior? And if so, the only way to get
> partitions internally sorted is to follow it with something like this
>
> val rdd2 = rdd.
val rdd2 = rdd.partitionBy(my partitioner).reduceByKey(some function)
I see that rdd2's partitions are not internally sorted. Can someone confirm
that this is expected behavior? And if so, the only way to get partitions
internally sorted is to follow it with something like this
val
I have the exact same error running on a bare metal cluster with CentOS6
and Python 2.6.6. Any other thoughts on the problem here? I only get the
error on operations that require communication, like reduceByKey or groupBy.
On Sun, Mar 2, 2014 at 1:29 PM, Nicholas Chammas wrote:
> Alright,
l = new ArrayList>();
for (Tuple2> value : values) {
count += value._1;
l.add(value._2);
}
return new Tuple2>>(
count, l);
}
});
Or "combineByKey" which often has better performance.
Best Regards,
Shixiong Zhu
2014-03-14 0:56 GMT+08:00 goi cto :
> Hi,
>
> I have
Hi,
I have an RDD with > which I want to reduceByKey and get
I+I and List of List
(add the integers and build a list of the lists.
BUT reduce by key requires that the return value is of the same type of the
input
so I can combine the lists.
JavaPairRDD>>> callCount =
byCaller.*reduc
>> a job with X reducers then Spark will open C*X files in parallel and
>> start writing. Shuffle consolidation will help decrease the total
>> number of files created but the number of file handles open at any
>> time doesn't change so it won't help the ulimit pro
X reducers then Spark will open C*X files in parallel and
> start writing. Shuffle consolidation will help decrease the total
> number of files created but the number of file handles open at any
> time doesn't change so it won't help the ulimit problem.
>
> This mea
x27;t change so it won't help the ulimit problem.
This means you'll have to use fewer reducers (e.g. pass reduceByKey a
number of reducers) or use fewer cores on each machine.
- Patrick
On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
wrote:
> Hi everyone,
>
> My team (cc'
Hi everyone,
My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
operation on a cluster of 10 slaves where I don't have the privileges to
set "ulimit -n" to a higher number. I'm running on a cluster where "ulimit
-n" returns 1024 on each machi
h.Mailbox.processMailbox(Mailbox.scala:237) at
>> akka.dispatch.Mailbox.run(Mailbox.scala:219) at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
t.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>
>
>
> The lambda passed to flatMap() returns a list of tuples; take() works fine
> just on the flatMap().
>
> Where would I start to troubleshoot this error?
>
> The error output includes m
, I upgraded the cluster to Python 2.7 using the
instructions here <https://spark-project.atlassian.net/browse/SPARK-922>.
Also, I am running Spark 0.9.0, though I notice that in the error output is
mention of 0.8.1 files.
Nick
--
View this message in context:
http://apache-spark-user-l
rect approach, sortByKey
> maybe what ia m looking for any insight would be helpful... there are not
> many example out there for newbies such as myself
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2078.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
ser-list.1001560.n3.nabble.com/ReduceByKey-or-groupByKey-to-Count-tp1765p2078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
201 - 263 of 263 matches
Mail list logo