Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander
Dear Spark developers, I am trying to measure the Spark reduce performance for big vectors. My motivation is related to machine learning gradient. Gradient is a vector that is computed on each worker and then all results need to be summed up and broadcasted back to workers. For example,

Re: query planner design doc?

2015-01-23 Thread Nicholas Murphy
Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of strategies are basically embodied in SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing what strategies

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies
Added PR https://github.com/apache/spark/pull/4139 https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick On 19 Jan 2015, at 18:31, Reynold Xin r...@databricks.com wrote: Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10

RE: Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander
Hi DB Tsai, Thank you for your suggestion. Actually, I've started my experiments with treeReduce. Originally, I had vv.treeReduce(_ + _, 2) in my script exactly because MLlib optimizers are using it, as you pointed out with LBFGS. However, it leads to the same problems as reduce, but

Re: query planner design doc?

2015-01-23 Thread Michael Armbrust
No, are you looking for something in particular? On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy halcyo...@gmail.com wrote: Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of

Re: Spark 1.2.0: MissingRequirementError

2015-01-23 Thread Peter Prettenhofer
much appreciated if somebody could help fixing this issue -- or at least give me some hints what might be wrong thanks, Peter 2015-01-15 14:04 GMT+01:00 PierreB pierre.borckm...@realimpactanalytics.com : Hi guys, A few people seem to have the same problem with Spark 1.2.0 so I figured I

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai
Hi Alexander, For `reduce`, it's an action that will collect all the data from mapper to driver, and perform the aggregation in driver. As a result, if the output from the mapper is very large, and the numbers of partitions in mapper are large, it might cause a problem. For `treeReduce`, as the

Re: Spark performance gains for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries. For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up

Find the two storage Locations of each partition of a replicated rdd.

2015-01-23 Thread Rapelly Kartheek
hi, I wanna find the storage locations( BlockManagerIds) of each partition when the rdd is replicated twice. I mean, If a twice replicated rdd has got 5 partitions, I would like to know the first and second storage locations of each partition. Basically, I am trying to modify the list of nodes

Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia
It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-23 Thread Sean Owen
Did you use spark.files.userClassPathFirst = true? it's exactly for this kind of problem. On Fri, Jan 23, 2015 at 4:42 AM, William-Smith williamsmith.m...@gmail.com wrote: I have had the same issue while using HttpClient from AWS EMR Spark Streaming to post to a nodejs server. I have found