Dear Spark developers,
I am trying to measure the Spark reduce performance for big vectors. My
motivation is related to machine learning gradient. Gradient is a vector that
is computed on each worker and then all results need to be summed up and
broadcasted back to workers. For example,
Okay, thanks. The design document mostly details the infrastructure for
optimization strategies but doesn’t detail the strategies themselves. I take
it the set of strategies are basically embodied in SparkStrategies.scala...is
there a design doc/roadmap/JIRA issue detailing what strategies
Added PR https://github.com/apache/spark/pull/4139
https://github.com/apache/spark/pull/4139 - I think tests have been
re-arranged so merge necessary
Mick
On 19 Jan 2015, at 18:31, Reynold Xin r...@databricks.com wrote:
Definitely go for a pull request!
On Mon, Jan 19, 2015 at 10:10
Hi DB Tsai,
Thank you for your suggestion. Actually, I've started my experiments with
treeReduce. Originally, I had vv.treeReduce(_ + _, 2) in my script exactly
because MLlib optimizers are using it, as you pointed out with LBFGS. However,
it leads to the same problems as reduce, but
No, are you looking for something in particular?
On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy halcyo...@gmail.com
wrote:
Okay, thanks. The design document mostly details the infrastructure for
optimization strategies but doesn’t detail the strategies themselves. I
take it the set of
much appreciated if somebody could help fixing this issue -- or at least
give me some hints what might be wrong
thanks,
Peter
2015-01-15 14:04 GMT+01:00 PierreB pierre.borckm...@realimpactanalytics.com
:
Hi guys,
A few people seem to have the same problem with Spark 1.2.0 so I figured I
Hi Alexander,
For `reduce`, it's an action that will collect all the data from
mapper to driver, and perform the aggregation in driver. As a result,
if the output from the mapper is very large, and the numbers of
partitions in mapper are large, it might cause a problem.
For `treeReduce`, as the
Hey Matei,
Thanks for your reply. We would keep in mind to use JDBC server for smaller
queries.
For the mapreduce job start-up, are you pointing towards JVM initialization
latencies in MR? Other than JVM initialization, does Spark do any
optimization (that is not done by mapreduce) to speed up
hi,
I wanna find the storage locations( BlockManagerIds) of each partition when
the rdd is replicated twice. I mean, If a twice replicated rdd has got 5
partitions, I would like to know the first and second storage locations of
each partition. Basically, I am trying to modify the list of nodes
It's hard to tell without more details, but the start-up latency in Hive can
sometimes be high, especially if you are running Hive on MapReduce. MR just
takes 20-30 seconds per job to spin up even if the job is doing nothing.
For real use of Spark SQL for short queries by the way, I'd recommend
Did you use spark.files.userClassPathFirst = true? it's exactly for
this kind of problem.
On Fri, Jan 23, 2015 at 4:42 AM, William-Smith
williamsmith.m...@gmail.com wrote:
I have had the same issue while using HttpClient from AWS EMR Spark Streaming
to post to a nodejs server.
I have found
11 matches
Mail list logo