You should look at using Mesos. This should abstract away the individual
hosts into a pool of resources and make the different physical
specifications manageable.
I haven't tried configuring Spark Standalone mode to have different specs
on different machines but based on spark-env.sh.template:
#
I don't see anything that says you must explicitly restart them to load the
new settings, but usually there is some sort of signal trapped [or brute
force full restart] to get a configuration reload for most daemons. I'd
take a guess and use the $SPARK_HOME/sbin/{stop,start}-slaves.sh scripts on
If you want to design something like Spark shell have a look at:
http://zeppelin-project.org/
Its open source and may already do what you need. If not, its source code
will be helpful in answering the questions about how to integrate with long
running jobs that you have.
On Thu Feb 05 2015 at
I've been doing a bunch of work with CSVs in Spark, mostly saving them as a
merged CSV (instead of the various part-n files). You might find the
following links useful:
- This article is about combining the part files and outputting a header as
the first line in the merged results:
Good questions, some of which I'd like to know the answer to.
Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?
This depends on how you are going to use the aggregate data.
1. Is there a lot of data? If so, and you are going to use
Did you restart the slaves so they would read the settings? You don't need
to start/stop the EC2 cluster, just the slaves. From the master node:
$SPARK_HOME/sbin/stop-slaves.sh
$SPARK_HOME/sbin/start-slaves.sh
($SPARK_HOME is probably /root/spark)
On Fri Feb 06 2015 at 10:31:18 AM Joe Wass
In case anyone needs to merge all of their part-n files (small result
set only) into a single *.csv file or needs to generically flatten case
classes, tuples, etc., into comma separated values:
http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/
On Tue Feb 03 2015 at 8:23:59 AM
A central location, such as NFS?
If they are temporary for the purpose of further job processing you'll want
to keep them local to the node in the cluster, i.e., in /tmp. If they are
centralized you won't be able to take advantage of data locality and the
central file store will become a
that route,
since that's the performance advantage Spark has over vanilla Hadoop.
On Wed Feb 11 2015 at 2:10:36 PM Tassilo Klein tjkl...@gmail.com wrote:
Thanks for the info. The file system in use is a Lustre file system.
Best,
Tassilo
On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke charles.fed
file that resides on HDFS, so that it will be available to my
driver program wherever that program runs.
--
Emre
On Mon, Feb 16, 2015 at 4:41 PM, Charles Feduke charles.fed...@gmail.com
wrote:
I haven't actually tried mixing non-Spark settings into the Spark
properties. Instead I package
I cannot comment about the correctness of Python code. I will assume your
caper_kv is keyed on something that uniquely identifies all the rows that
make up the person's record so your group by key makes sense, as does the
map. (I will also assume all of the rows that comprise a single person's
there's much to be gained in moving the data from MySQL to Spark
first.
I have yet to find any non-trivial examples of ETL logic on the web ...
it seems like it's mostly word count map-reduce replacements.
On 02/16/2015 01:32 PM, Charles Feduke wrote:
I cannot comment about the correctness
I haven't actually tried mixing non-Spark settings into the Spark
properties. Instead I package my properties into the jar and use the
Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala
specific) to get at my properties:
Properties file: src/main/resources/integration.conf
(below
Absolute path means no ~ and also verify that you have the path to the file
correct. For some reason the Python code does not validate that the file
exists and will hang (this is the same reason why ~ hangs).
On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick pzybr...@gmail.com wrote:
Try using an
I deal with problems like this so often across Java applications with large
dependency trees. Add the shell function at the following link to your
shell on the machine where your Spark Streaming is installed:
https://gist.github.com/cfeduke/fe63b12ab07f87e76b38
Then run in the directory where
as
bash.
Nick
On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke charles.fed...@gmail.com
wrote:
It was only hanging when I specified the path with ~ I never tried
relative.
Hanging on the waiting for ssh to be ready on all hosts. I let it sit for
about 10 minutes then I found
that for Spark 1.2.0
https://issues.apache.org/jira/browse/SPARK-4137. Maybe there’s some
case that we missed?
Nick
On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke
charles.fed...@gmail.com wrote:
Absolute path means no ~ and also verify that you have the path to the
file correct. For some reason
:
This is what I get:
./bigcontent-1.0-SNAPSHOT.jar:org/apache/http/impl/conn/Sch
emeRegistryFactory.class
(probably because I'm using a self-contained JAR).
In other words, I'm still stuck.
--
Emre
On Wed, Jan 28, 2015 at 2:47 PM, Charles Feduke charles.fed...@gmail.com
wrote:
I
You'll still need to:
import org.apache.spark.SparkContext._
Importing org.apache.spark._ does _not_ recurse into sub-objects or
sub-packages, it only brings in whatever is at the level of the package or
object imported.
SparkContext._ has some implicits, one of them for adding groupByKey to an
Are you using the default Java object serialization, or have you tried Kryo
yet? If you haven't tried Kryo please do and let me know how much it
impacts the serialization size. (I know its more efficient, I'm curious to
know how much more efficient, and I'm being lazy - I don't have ~6K 500MB
Define not working. Not compiling? If so you need:
import org.apache.spark.SparkContext._
On Fri Jan 30 2015 at 3:21:45 PM Amit Behera amit.bd...@gmail.com wrote:
hi all,
my sbt file is like this:
name := Spark
version := 1.0
scalaVersion := 2.10.4
libraryDependencies +=
I'm trying to figure out the best approach to getting sharded data from
PostgreSQL into Spark.
Our production PGSQL cluster has 12 shards with TiB of data on each shard.
(I won't be accessing all of the data on a shard at once, but I don't think
its feasible to use Sqoop to copy tables who's data
.)
Because of the sub-range bucketing and cluster distribution you shouldn't
run into OOM errors, assuming you provision sufficient worker nodes in the
cluster.
On Sun Jan 25 2015 at 9:39:56 AM Charles Feduke charles.fed...@gmail.com
wrote:
I'm facing a similar problem except my data is already
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts.
I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.
What parameters are you using when you execute spark-ec2?
I am launching in the us-west-1 region (ami-7a320f3f) which may explain
things.
On Mon Jan 26 2015
I think you want to instead use `.saveAsSequenceFile` to save an RDD to
someplace like HDFS or NFS it you are attempting to interoperate with
another system, such as Hadoop. `.persist` is for keeping the contents of
an RDD around so future uses of that particular RDD don't need to
recalculate its
I'm facing a similar problem except my data is already pre-sharded in
PostgreSQL.
I'm going to attempt to solve it like this:
- Submit the shard names (database names) across the Spark cluster as a
text file and partition it so workers get 0 or more - hopefully 1 - shard
name. In this case you
I have been trying to work around a similar problem with my Typesafe config
*.conf files seemingly not appearing on the executors. (Though now that I
think about it its not because the files are absent in the JAR, but because
the -Dconf.resource environment variable I pass to the master obviously
1.2.0 binary
(pre-built for Hadoop 2.4 and later).
Or maybe I'm totally wrong, and the problem / fix is something completely
different?
--
Emre
On Wed, Jan 28, 2015 at 4:58 PM, Charles Feduke charles.fed...@gmail.com
wrote:
It looks like you're shading in the Apache HTTP commons library
Assuming you are on Linux, what is your /etc/security/limits.conf set for
nofile/soft (number of open file handles)?
On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
I try to run a simple sort by on 1.2.1. And it always give me below two
errors:
1,
What I found from a quick search of the Spark source code (from my local
snapshot on January 25, 2015):
// Interval between each check for event log updates
private val UPDATE_INTERVAL_MS =
conf.getInt(spark.history.fs.updateInterval,
conf.getInt(spark.history.updateInterval, 10)) * 1000
This should help you understand the cost of running a Spark cluster for a
short period of time:
http://www.ec2instances.info/
If you run an instance for even 1 second of a single hour you are charged
for that complete hour. So before you shut down your miniature cluster make
sure you really are
Scala is the language used to write Spark so there's never a situation in
which features introduced in a newer version of Spark cannot be taken
advantage of if you write your code in Scala. (This is mostly true of Java,
but it may be a little more legwork if a Java-friendly adapter isn't
available
How long does each executor keep the connection open for? How many
connections does each executor open?
Are you certain that connection pooling is a performant and suitable
solution? Are you running out of resources on the database server and
cannot tolerate each executor having a single
across jobs
On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com
wrote:
How long does each executor keep the connection open for? How many
connections does each executor open?
Are you certain that connection pooling is a performant and suitable
solution? Are you running
You could also try setting your `nofile` value in /etc/security/limits.conf
for `soft` to some ridiculously high value if you haven't done so already.
On Fri, Apr 3, 2015 at 2:09 AM Akhil Das ak...@sigmoidanalytics.com wrote:
Did you try these?
- Disable shuffle : spark.shuffle.spill=false
-
As Akhil says Ubuntu is a good choice if you're starting from near scratch.
Cloudera CDH virtual machine images[1] include Hadoop, HDFS, Spark, and
other big data tools so you can get a cluster running with very little
effort. Keep in mind Cloudera is a for-profit corporation so they are also
36 matches
Mail list logo