Gents,
I've been benchmarking Presto, Spark, Impala and Redshift.
I've been looking most recently at minimum query latency.
In all cases, the cluster consists of eight m1.large EC2 instances.
The miniimal data set is a single 3.5mb gzipped file.
With Presto (backed by s3), I see 1 to 2 second
(Spark here is using s3).
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das debasish.da...@gmail.com
wrote:
600s for Spark vs 5s for Redshift...The numbers look much different from
the amplab benchmark...
https://amplab.cs.berkeley.edu/benchmark/
Is it like SSDs or something that's helping redshift or the whole data is
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson ilike...@gmail.com wrote:
Note that regarding a long load time, data format means a whole lot in
terms of query performance. If you load all your data into compressed,
columnar Parquet files on local hardware, Spark SQL would also perform far,
I've just benchmarked Spark and Impala. Same data (in s3), same query,
same cluster.
Impala has a long load time, since it cannot load directly from s3. I have
to create a Hive table on s3, then insert from that to an Impala table.
This takes a long time; Spark took about 600s for the query,
On Thu, Jun 12, 2014 at 4:48 PM, Andre Schumacher
schum...@icsi.berkeley.edu wrote:
On 06/12/2014 05:47 PM, Toby Douglass wrote:
In these future jobs, when I come to load the aggregted RDD, will Spark
load and only load the columns being accessed by the query? or will
Spark
load
Gents,
I have been bringing up a cluster on EC2 using the spark_ec2.py script.
This works if the cluster has a single slave.
This fails if the cluster has sixteen slaves, during the work to transfer
the SSH key to the slaves. I cannot currently bring up a large cluster.
Can anyone shed any
On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6
Ah, yes - I mean to say, Amazon Linux.
.Have you tried either:
1. Retrying launch with the --resume option?
2. Increasing the
On Thu, Jun 12, 2014 at 9:10 PM, Zongheng Yang zonghen...@gmail.com wrote:
Hi Toby,
It is usually the case that even if the EC2 console says the nodes are
up, they are not really fully initialized. For 16 nodes I have found
`--wait 800` to be the norm that makes things work.
It seems so!