Spark performance gains for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col

Re: spark 1.1.0 (w/ hadoop 2.4) vs aws java sdk 1.7.2

2015-01-22 Thread William-Smith
I have had the same issue while using HttpClient from AWS EMR Spark Streaming to post to a nodejs server. I have found ... using Classloder.getResource('org/apache/http/client/HttpClient") that the class Is being loaded front the spark-assembly-1.1.0-hadoop2.4.0.jar. That in itself is not t

Re: query planner design doc?

2015-01-22 Thread Michael Armbrust
Here is the initial design document for catalyst : https://docs.google.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit Strategies (many of which are in SparkStragegies.scala) are the part that creates the physical operators from a catalyst logical plan. These operators have execu

query planner design doc?

2015-01-22 Thread Nicholas Murphy
Hi- Quick question: is there a design doc (or something more than “look at the code”) for the query planner for Spark SQL (i.e., the component that takes…Catalyst?…operator trees and translates them into SPARK operations)? Thanks, Nick ---

Re: Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Dean Wampler
Interesting. I was wondering recently if anyone has explored working with compressed data directly. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler

Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
Thanks Xiangrui Meng will try this. And, found this https://github.com/kaushikranjan/knnJoin also. Will this work with double data ? Can we find out z value of *Vector(10.3,4.5,3,5)* ? On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng wrote: > For large datasets, you need hashing in order to

Are there any plans to run Spark on top of Succinct

2015-01-22 Thread Mick Davies
http://succinct.cs.berkeley.edu/wp/wordpress/ Looks like a really interesting piece of work that could dovetail well with Spark. I have been trying recently to optimize some queries I have running on Spark on top of Parquet but the support from Parquet for predicate push down especially for dict