high minimum query latency

2014-06-29 Thread Toby Douglass
Gents, I've been benchmarking Presto, Spark, Impala and Redshift. I've been looking most recently at minimum query latency. In all cases, the cluster consists of eight m1.large EC2 instances. The miniimal data set is a single 3.5mb gzipped file. With Presto (backed by s3), I see 1 to 2 second

Re: high minimum query latency

2014-06-29 Thread Toby Douglass
(Spark here is using s3). ​

Spark with HBase

2014-06-29 Thread N . Venkata Naga Ravi
I am using follwoing versiongs .. spark-1.0.0-bin-hadoop2 hbase-0.96.1.1-hadoop2 When executing Hbase Test , i am facing following exception. Looks like some version incompatibility, can you please help on it. NERAVI-M-70HY:spark-1.0.0-bin-hadoop2 neravi$ ./bin/run-example

RE: Spark with HBase

2014-06-29 Thread N . Venkata Naga Ravi
+user@spark.apache.org From: nvn_r...@hotmail.com To: u...@spark.incubator.apache.org Subject: Spark with HBase Date: Sun, 29 Jun 2014 15:28:43 +0530 I am using follwoing versiongs .. spark-1.0.0-bin-hadoop2 hbase-0.96.1.1-hadoop2 When executing Hbase Test , i am facing

Kafka/ES question

2014-06-29 Thread boci
Hi! I try to use spark with kafka, everything is work but I found a little problem. I create a small test application which connect to real kafka cluster, send a message and read it back. It's work, but when I run my test second time (send/read) it's read the first and the second stream (maybe

Spark Streaming with HBase

2014-06-29 Thread N . Venkata Naga Ravi
Hi, Is there any example provided for Spark Streaming with Input provided from HBase table content. Thanks, Ravi

Memory/Network Intensive Workload

2014-06-29 Thread danilopds
Hello, I'm studying the Spark platform and I'd like to realize experiments in your extension Spark Streaming. So, I guess that an intensive memory and network workload are a good options. Can anyone suggest a few typical Spark Streaming workloads that are network/memory intensive? If someone

Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Robert James
Although Spark's home page offers binaries for Spark 1.0.0 with Hadoop 2, the Maven repository only seems to have one version, which uses Hadoop 1. Is it possible to use a Maven link and Hadoop 2? What is the id? If not: How can I use the prebuilt binaries to use Hadoop 2? Do I just copy the

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread FRANK AUSTIN NOTHAFT
Robert, You can build a Spark application using Maven for Hadoop 2 by adding a dependency on the Hadoop 2.* hadoop-client package. If you define any Hadoop Input/Output formats, you may also need to depend on the hadoop-mapreduce package. Regards, Frank Austin Nothaft fnoth...@berkeley.edu

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Siyuan he
Hi Robert, I am using the following maven command to build spark 1.0 for hadoop 2 + hbase 0.96.2: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 -DskipTests clean package Regards, siyuan On Sun, Jun 29, 2014 at 3:20 PM, Robert James srobertja...@gmail.com wrote: Although Spark's home

Issues starting up Spark on mesos - akka.version

2014-06-29 Thread _soumya_
I'm new to Spark and not very experienced with scala issues. I'm facing this error message while trying to start up Spark on Mesos on a vagrant box. vagrant@mesos:~/installs/spark-1.0.0$ java -cp rickshaw-spark-0.0.1-SNAPSHOT.jar com.evocalize.rickshaw.spark.applications.GenerateSEOContent -m

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Robert James
On 6/29/14, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: Robert, You can build a Spark application using Maven for Hadoop 2 by adding a dependency on the Hadoop 2.* hadoop-client package. If you define any Hadoop Input/Output formats, you may also need to depend on the hadoop-mapreduce

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Frank Austin Nothaft
Hi Robert, I’m not sure about sbt; we’re currently using Maven to build. We do create a single jar though, via the Maven shade plugin. Our project has three components, and we routinely distribute the jar for our project’s CLI out across a cluster. If you’re interested, here are our project’s

Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
Given the following time series data: name, time, value x,2,9 x,1,3 x,3,6 y,2,5 y,1,7 y,3,1 z,3,7 z,4,0 z,1,4 z,2,8 we want to generate the following (the reduced/grouped values are sorted by time). x = [(1,3), (2,9), (3,6)] y = [(1,7), (2,5), (3,1)] z = [(1,4), (2,8), (3,7), (4,0)] One

Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Shao, Saisai
Hi Mahmoud, I think you cannot achieve this in current Spark framework, because current Spark's Shuffle is based on hash, which is different from MapReduce's sort-based shuffle, so you should implement sorting explicitly using RDD operator. Thanks Jerry From: Parsian, Mahmoud

questions about shuffle time and parallel degree

2014-06-29 Thread wxhsdp
Hi, all i have two questions about shuffle time and parallel degree. question 1: we assume that cluster size is fixed, for example a cluster of 16 nodes, each node has 2 cores in EC2 case 1: a total shuffle of 64GB data between 32 partitions case 2: a total shuffle of 128GB data between

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
Hi Jerry, Thank you for replying to my question. If indeed, spark does not have secondary sort by framework, then that is a limitation. There might be cases where you have more values per key that can not be handled in a commodity server's memory (I mean sorting values in RAM). If we had a

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Shao, Saisai
Yes, the current implementation has the memory limitation, the community already noticed this problem and there's a patch to solve this problem (PR931https://github.com/apache/spark/pull/931), you can click to see the details. Also as you said, current Spark cannot guarantee the order of

Re: Reconnect to an application/RDD

2014-06-29 Thread Chris Fregly
Tachyon is another option - this is the off heap StorageLevel specified when persisting RDDs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel or just use HDFS. this requires subsequent Applications/SparkContext's to reload the data from disk, of

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
Jerry, thank you very much for further clarifications! best, Mahmoud From: Shao, Saisai [saisai.s...@intel.com] Sent: Sunday, June 29, 2014 8:17 PM To: user@spark.apache.org Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting Yes, the current

Re: Could not compute split, block not found

2014-06-29 Thread Bill Jay
Tobias, Thanks for your help. I think in my case, the batch size is 1 minute. However, it takes my program more than 1 minute to process 1 minute's data. I am not sure whether it is because the unprocessed data pile up. Do you have an suggestion on how to check it and solve it? Thanks! Bill On