Gents,
I've been benchmarking Presto, Spark, Impala and Redshift.
I've been looking most recently at minimum query latency.
In all cases, the cluster consists of eight m1.large EC2 instances.
The miniimal data set is a single 3.5mb gzipped file.
With Presto (backed by s3), I see 1 to 2 second
(Spark here is using s3).
I am using follwoing versiongs ..
spark-1.0.0-bin-hadoop2
hbase-0.96.1.1-hadoop2
When executing Hbase Test , i am facing following exception. Looks like some
version incompatibility, can you please help on it.
NERAVI-M-70HY:spark-1.0.0-bin-hadoop2 neravi$ ./bin/run-example
+user@spark.apache.org
From: nvn_r...@hotmail.com
To: u...@spark.incubator.apache.org
Subject: Spark with HBase
Date: Sun, 29 Jun 2014 15:28:43 +0530
I am using follwoing versiongs ..
spark-1.0.0-bin-hadoop2
hbase-0.96.1.1-hadoop2
When executing Hbase Test , i am facing
Hi!
I try to use spark with kafka, everything is work but I found a little
problem. I create a small test application which connect to real kafka
cluster, send a message and read it back. It's work, but when I run my test
second time (send/read) it's read the first and the second stream (maybe
Hi,
Is there any example provided for Spark Streaming with Input provided from
HBase table content.
Thanks,
Ravi
Hello,
I'm studying the Spark platform and I'd like to realize experiments in your
extension Spark Streaming.
So,
I guess that an intensive memory and network workload are a good options.
Can anyone suggest a few typical Spark Streaming workloads that are
network/memory intensive?
If someone
Although Spark's home page offers binaries for Spark 1.0.0 with Hadoop
2, the Maven repository only seems to have one version, which uses
Hadoop 1.
Is it possible to use a Maven link and Hadoop 2? What is the id?
If not: How can I use the prebuilt binaries to use Hadoop 2? Do I just
copy the
Robert,
You can build a Spark application using Maven for Hadoop 2 by adding a
dependency on the Hadoop 2.* hadoop-client package. If you define any
Hadoop Input/Output formats, you may also need to depend on the
hadoop-mapreduce package.
Regards,
Frank Austin Nothaft
fnoth...@berkeley.edu
Hi Robert,
I am using the following maven command to build spark 1.0 for hadoop 2 +
hbase 0.96.2:
mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 -DskipTests clean
package
Regards,
siyuan
On Sun, Jun 29, 2014 at 3:20 PM, Robert James srobertja...@gmail.com
wrote:
Although Spark's home
I'm new to Spark and not very experienced with scala issues. I'm facing this
error message while trying to start up Spark on Mesos on a vagrant box.
vagrant@mesos:~/installs/spark-1.0.0$ java -cp
rickshaw-spark-0.0.1-SNAPSHOT.jar
com.evocalize.rickshaw.spark.applications.GenerateSEOContent -m
On 6/29/14, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote:
Robert,
You can build a Spark application using Maven for Hadoop 2 by adding a
dependency on the Hadoop 2.* hadoop-client package. If you define any
Hadoop Input/Output formats, you may also need to depend on the
hadoop-mapreduce
Hi Robert,
I’m not sure about sbt; we’re currently using Maven to build. We do create a
single jar though, via the Maven shade plugin. Our project has three
components, and we routinely distribute the jar for our project’s CLI out
across a cluster. If you’re interested, here are our project’s
Given the following time series data:
name, time, value
x,2,9
x,1,3
x,3,6
y,2,5
y,1,7
y,3,1
z,3,7
z,4,0
z,1,4
z,2,8
we want to generate the following (the reduced/grouped values are sorted by
time).
x = [(1,3), (2,9), (3,6)]
y = [(1,7), (2,5), (3,1)]
z = [(1,4), (2,8), (3,7), (4,0)]
One
as brian g alluded to earlier, you can use DStream.mapPartitions() to
return the partition-local top 10 for each partition. once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.
this leads to a much much-smaller dataset to be shuffled
Hi Mahmoud,
I think you cannot achieve this in current Spark framework, because current
Spark's Shuffle is based on hash, which is different from MapReduce's
sort-based shuffle, so you should implement sorting explicitly using RDD
operator.
Thanks
Jerry
From: Parsian, Mahmoud
Hi, all
i have two questions about shuffle time and parallel degree.
question 1:
we assume that cluster size is fixed, for example a cluster of 16 nodes,
each node has 2 cores in EC2
case 1: a total shuffle of 64GB data between 32 partitions
case 2: a total shuffle of 128GB data between
Hi Jerry,
Thank you for replying to my question. If indeed, spark does not have
secondary sort by framework, then that is a limitation. There might be cases
where you have more values per key that can not be handled in a commodity
server's memory (I mean sorting values in RAM).
If we had a
Yes, the current implementation has the memory limitation, the community
already noticed this problem and there's a patch to solve this problem
(PR931https://github.com/apache/spark/pull/931), you can click to see the
details.
Also as you said, current Spark cannot guarantee the order of
Tachyon is another option - this is the off heap StorageLevel specified
when persisting RDDs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel
or just use HDFS. this requires subsequent Applications/SparkContext's to
reload the data from disk, of
Jerry, thank you very much for further clarifications!
best,
Mahmoud
From: Shao, Saisai [saisai.s...@intel.com]
Sent: Sunday, June 29, 2014 8:17 PM
To: user@spark.apache.org
Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting
Yes, the current
Tobias,
Thanks for your help. I think in my case, the batch size is 1 minute.
However, it takes my program more than 1 minute to process 1 minute's data.
I am not sure whether it is because the unprocessed data pile up. Do you
have an suggestion on how to check it and solve it? Thanks!
Bill
On
22 matches
Mail list logo