date:20150123

Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander

Dear Spark developers, I am trying to measure the Spark reduce performance for big vectors. My motivation is related to machine learning gradient. Gradient is a vector that is computed on each worker and then all results need to be summed up and broadcasted back to workers. For example,

Re: query planner design doc?

2015-01-23 Thread Nicholas Murphy

Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of strategies are basically embodied in SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing what strategies

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies

Added PR https://github.com/apache/spark/pull/4139 https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick On 19 Jan 2015, at 18:31, Reynold Xin r...@databricks.com wrote: Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10

RE: Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander

Hi DB Tsai, Thank you for your suggestion. Actually, I've started my experiments with treeReduce. Originally, I had vv.treeReduce(_ + _, 2) in my script exactly because MLlib optimizers are using it, as you pointed out with LBFGS. However, it leads to the same problems as reduce, but

Re: query planner design doc?

2015-01-23 Thread Michael Armbrust

No, are you looking for something in particular? On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy halcyo...@gmail.com wrote: Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of

Re: Spark 1.2.0: MissingRequirementError

2015-01-23 Thread Peter Prettenhofer

much appreciated if somebody could help fixing this issue -- or at least give me some hints what might be wrong thanks, Peter 2015-01-15 14:04 GMT+01:00 PierreB pierre.borckm...@realimpactanalytics.com : Hi guys, A few people seem to have the same problem with Spark 1.2.0 so I figured I

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai

Hi Alexander, For `reduce`, it's an action that will collect all the data from mapper to driver, and perform the aggregation in driver. As a result, if the output from the mapper is very large, and the numbers of partitions in mapper are large, it might cause a problem. For `treeReduce`, as the

Re: Spark performance gains for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)

Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries. For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up

Find the two storage Locations of each partition of a replicated rdd.

2015-01-23 Thread Rapelly Kartheek

hi, I wanna find the storage locations( BlockManagerIds) of each partition when the rdd is replicated twice. I mean, If a twice replicated rdd has got 5 partitions, I would like to know the first and second storage locations of each partition. Basically, I am trying to modify the list of nodes

Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia

It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-23 Thread Sean Owen

Did you use spark.files.userClassPathFirst = true? it's exactly for this kind of problem. On Fri, Jan 23, 2015 at 4:42 AM, William-Smith williamsmith.m...@gmail.com wrote: I have had the same issue while using HttpClient from AWS EMR Spark Streaming to post to a nodejs server. I have found

Maximum size of vector that reduce can handle

Re: query planner design doc?

Re: Optimize encoding/decoding strings when using Parquet

RE: Maximum size of vector that reduce can handle

Re: query planner design doc?

Re: Spark 1.2.0: MissingRequirementError

Re: Maximum size of vector that reduce can handle

Re: Spark performance gains for small queries

Find the two storage Locations of each partition of a replicated rdd.

Re: Spark performance gains for small queries

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

11 matches

Site Navigation

Mail list logo

Footer information