Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia
It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-23 Thread Sean Owen
Did you use spark.files.userClassPathFirst = true? it's exactly for this kind of problem. On Fri, Jan 23, 2015 at 4:42 AM, William-Smith wrote: > I have had the same issue while using HttpClient from AWS EMR Spark Streaming > to post to a nodejs server. > > I have found ... using > Classloder.get

Re: Spark performance gains for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries. For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up th

Re: query planner design doc?

2015-01-23 Thread Nicholas Murphy
Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of strategies are basically embodied in SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing what strategies exi

Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander
Dear Spark developers, I am trying to measure the Spark reduce performance for big vectors. My motivation is related to machine learning gradient. Gradient is a vector that is computed on each worker and then all results need to be summed up and broadcasted back to workers. For example, present

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies
Added PR https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick > On 19 Jan 2015, at 18:31, Reynold Xin wrote: > > Definitely go for a pull request! > > > On Mon, Jan 19, 2015 at 10:10 AM, Mick Dav

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai
Hi Alexander, When you use `reduce` to aggregate the vectors, those will actually be pulled into driver, and merged over there. Obviously, it's not scaleable given you are doing deep neural networks which have so many coefficients. Please try treeReduce instead which is what we do in linear regre

RE: Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander
Hi DB Tsai, Thank you for your suggestion. Actually, I've started my experiments with "treeReduce". Originally, I had "vv.treeReduce(_ + _, 2)" in my script exactly because MLlib optimizers are using it, as you pointed out with LBFGS. However, it leads to the same problems as "reduce", but pres

Re: query planner design doc?

2015-01-23 Thread Michael Armbrust
No, are you looking for something in particular? On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy wrote: > Okay, thanks. The design document mostly details the infrastructure for > optimization strategies but doesn’t detail the strategies themselves. I > take it the set of strategies are basic

Re: Spark 1.2.0: MissingRequirementError

2015-01-23 Thread Peter Prettenhofer
much appreciated if somebody could help fixing this issue -- or at least give me some hints what might be wrong thanks, Peter 2015-01-15 14:04 GMT+01:00 PierreB : > Hi guys, > > A few people seem to have the same problem with Spark 1.2.0 so I figured I > would push it here. > > see: > > http://

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai
Hi Alexander, For `reduce`, it's an action that will collect all the data from mapper to driver, and perform the aggregation in driver. As a result, if the output from the mapper is very large, and the numbers of partitions in mapper are large, it might cause a problem. For `treeReduce`, as the n

Find the two storage Locations of each partition of a replicated rdd.

2015-01-23 Thread Rapelly Kartheek
hi, I wanna find the storage locations( BlockManagerIds) of each partition when the rdd is replicated twice. I mean, If a twice replicated rdd has got 5 partitions, I would like to know the first and second storage locations of each partition. Basically, I am trying to modify the list of nodes sel