high minimum query latency

2014-06-29 Thread Toby Douglass
Gents, I've been benchmarking Presto, Spark, Impala and Redshift. I've been looking most recently at minimum query latency. In all cases, the cluster consists of eight m1.large EC2 instances. The miniimal data set is a single 3.5mb gzipped file. With Presto (backed by s3), I see 1 to 2 second

Re: high minimum query latency

2014-06-29 Thread Toby Douglass
(Spark here is using s3). ​

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das debasish.da...@gmail.com wrote: 600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson ilike...@gmail.com wrote: Note that regarding a long load time, data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far,

Re: Shark vs Impala

2014-06-22 Thread Toby Douglass
I've just benchmarked Spark and Impala. Same data (in s3), same query, same cluster. Impala has a long load time, since it cannot load directly from s3. I have to create a Hive table on s3, then insert from that to an Impala table. This takes a long time; Spark took about 600s for the query,

Re: initial basic question from new user

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher schum...@icsi.berkeley.edu wrote: On 06/12/2014 05:47 PM, Toby Douglass wrote: In these future jobs, when I come to load the aggregted RDD, will Spark load and only load the columns being accessed by the query? or will Spark load

spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
Gents, I have been bringing up a cluster on EC2 using the spark_ec2.py script. This works if the cluster has a single slave. This fails if the cluster has sixteen slaves, during the work to transfer the SSH key to the slaves. I cannot currently bring up a large cluster. Can anyone shed any

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6 Ah, yes - I mean to say, Amazon Linux. .Have you tried either: 1. Retrying launch with the --resume option? 2. Increasing the

Re: spark EC2 bring-up problems

2014-06-12 Thread Toby Douglass
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang zonghen...@gmail.com wrote: Hi Toby, It is usually the case that even if the EC2 console says the nodes are up, they are not really fully initialized. For 16 nodes I have found `--wait 800` to be the norm that makes things work. It seems so!