date:20150123

Re: Spark performance gains for small queries

2015-01-23 Thread Matei Zaharia

It's hard to tell without more details, but the start-up latency in Hive can sometimes be high, especially if you are running Hive on MapReduce. MR just takes 20-30 seconds per job to spin up even if the job is doing nothing. For real use of Spark SQL for short queries by the way, I'd recommend

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-23 Thread Sean Owen

Did you use spark.files.userClassPathFirst = true? it's exactly for this kind of problem. On Fri, Jan 23, 2015 at 4:42 AM, William-Smith wrote: > I have had the same issue while using HttpClient from AWS EMR Spark Streaming > to post to a nodejs server. > > I have found ... using > Classloder.get

Re: Spark performance gains for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)

Hey Matei, Thanks for your reply. We would keep in mind to use JDBC server for smaller queries. For the mapreduce job start-up, are you pointing towards JVM initialization latencies in MR? Other than JVM initialization, does Spark do any optimization (that is not done by mapreduce) to speed up th

Re: query planner design doc?

2015-01-23 Thread Nicholas Murphy

Okay, thanks. The design document mostly details the infrastructure for optimization strategies but doesn’t detail the strategies themselves. I take it the set of strategies are basically embodied in SparkStrategies.scala...is there a design doc/roadmap/JIRA issue detailing what strategies exi

Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander

Dear Spark developers, I am trying to measure the Spark reduce performance for big vectors. My motivation is related to machine learning gradient. Gradient is a vector that is computed on each worker and then all results need to be summed up and broadcasted back to workers. For example, present

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies

Added PR https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick > On 19 Jan 2015, at 18:31, Reynold Xin wrote: > > Definitely go for a pull request! > > > On Mon, Jan 19, 2015 at 10:10 AM, Mick Dav

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai

Hi Alexander, When you use `reduce` to aggregate the vectors, those will actually be pulled into driver, and merged over there. Obviously, it's not scaleable given you are doing deep neural networks which have so many coefficients. Please try treeReduce instead which is what we do in linear regre

RE: Maximum size of vector that reduce can handle

2015-01-23 Thread Ulanov, Alexander

Hi DB Tsai, Thank you for your suggestion. Actually, I've started my experiments with "treeReduce". Originally, I had "vv.treeReduce(_ + _, 2)" in my script exactly because MLlib optimizers are using it, as you pointed out with LBFGS. However, it leads to the same problems as "reduce", but pres

Re: query planner design doc?

2015-01-23 Thread Michael Armbrust

No, are you looking for something in particular? On Fri, Jan 23, 2015 at 9:44 AM, Nicholas Murphy wrote: > Okay, thanks. The design document mostly details the infrastructure for > optimization strategies but doesn’t detail the strategies themselves. I > take it the set of strategies are basic

Re: Spark 1.2.0: MissingRequirementError

2015-01-23 Thread Peter Prettenhofer

much appreciated if somebody could help fixing this issue -- or at least give me some hints what might be wrong thanks, Peter 2015-01-15 14:04 GMT+01:00 PierreB : > Hi guys, > > A few people seem to have the same problem with Spark 1.2.0 so I figured I > would push it here. > > see: > > http://

Re: Maximum size of vector that reduce can handle

2015-01-23 Thread DB Tsai

Hi Alexander, For `reduce`, it's an action that will collect all the data from mapper to driver, and perform the aggregation in driver. As a result, if the output from the mapper is very large, and the numbers of partitions in mapper are large, it might cause a problem. For `treeReduce`, as the n

Find the two storage Locations of each partition of a replicated rdd.

2015-01-23 Thread Rapelly Kartheek

hi, I wanna find the storage locations( BlockManagerIds) of each partition when the rdd is replicated twice. I mean, If a twice replicated rdd has got 5 partitions, I would like to know the first and second storage locations of each partition. Basically, I am trying to modify the list of nodes sel

Re: Spark performance gains for small queries

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

Re: Spark performance gains for small queries

Re: query planner design doc?

Maximum size of vector that reduce can handle

Re: Optimize encoding/decoding strings when using Parquet

Re: Maximum size of vector that reduce can handle

RE: Maximum size of vector that reduce can handle

Re: query planner design doc?

Re: Spark 1.2.0: MissingRequirementError

Re: Maximum size of vector that reduce can handle

Find the two storage Locations of each partition of a replicated rdd.

12 matches

Site Navigation

Mail list logo

Footer information