We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3.
—Ken
On Aug 25, 2016, at 8:02 PM,
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool
"Contribute to the world, environment and more :
http://www.gridrepublic.org
"
On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken
<carli...@jane
re :
http://www.gridrepublic.org
"
On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken
<carli...@janelia.hhmi.org> wrote:
We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users ha
We run Spark on a general purpose HPC cluster (using standalone mode and the
HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has
been testing various storage and other parameters for Spark, which involves
doing multiple shuffles and shutting down and starting many
Cool. My users tend to interact with the driver via iPython Notebook, so clearly I’ll have to leave (fairly significant amounts of) ram for that. But I should be able to write a one liner into the spark-env.sh that will determine whether it’s on a 128 or 256GB
node and have it size itself
In the spark-env.sh example file, the comments indicate that the
spark.driver.memory is the memory for the master in YARN mode. None of that
actually makes any sense…
In any case, I’m using spark in a standalone mode, running the driver on a
separate machine from the master. I have a few
a max of 48GB (and it goes frequently beyond that). You will be better off using a lower number there and instead increasing the parallelism of your job (i.e. dividing
the job into more and smaller partitions).
On Sat, Mar 26, 2016 at 7:10 AM, Carlile, Ken <carli...@janelia.hhmi.o
normal circumstances this would leave parts of your resources underutilized).
Hope this helps!
-Sven
On Fri, Mar 25, 2016 at 10:41 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:
Further data on this.
I’m watching another job right now where there are 16 py
own, driving the load up. I’m hoping someone has seen something like this.
—Ken
On Mar 21, 2016, at 3:07 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote:
No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a littl
Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken
<carli...@janelia.hhmi.org> wrote:
Hello,
We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes i
512m
Amount of memory to use per python worker process during aggregation, in the same
format as JVM memory strings (e.g. 512m, 2g). If the memory
used during aggregation goes above this amount, it will spill the data into disks.
On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken
Hello,
We have an HPC cluster that we run Spark jobs on using standalone mode and a
number of scripts I’ve built up to dynamically schedule and start spark
clusters within the Grid Engine framework. Nodes in the cluster have 16 cores
and 128GB of RAM.
My users use pyspark heavily. We’ve
I am attempting to build Spark 1.6.0 from source on EL 6.3, using Oracle jdk
1.8.0.45, Python 2.7.6, and Scala 2.10.3. When I try to issue build/mvn/
-DskipTests clean package, I get the following:
[INFO] Using zinc server for incremental compilation
[info] Compiling 3 Java sources to
I start the spark master with $SPARK_HOME/sbin/start-master.sh, but I use the
following to start the workers:
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://$MASTER:7077
see my blog for more details, although I need to update the posts based on what
I’ve changed
14 matches
Mail list logo