Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread Carlile, Ken
We use Spark with NFS as the data store, mainly using Dr. Jeremy Freeman’s Thunder framework. Works very well (and I see HUGE throughput on the storage system during loads). I haven’t seen (or heard from the devs/users) a need for HDFS or S3. —Ken On Aug 25, 2016, at 8:02 PM,

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
Skype: thumsupdeicool Google talk: deicool Blog: http://loveandfearless.wordpress.com Facebook: http://www.facebook.com/deicool "Contribute to the world, environment and more : http://www.gridrepublic.org " On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken <carli...@jane

Re: Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
re : http://www.gridrepublic.org " On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users ha

Spark crashes worker nodes with multiple application starts

2016-06-16 Thread Carlile, Ken
We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has been testing various storage and other parameters for Spark, which involves doing multiple shuffles and shutting down and starting many

Re: spark.driver.memory meaning

2016-04-03 Thread Carlile, Ken
Cool. My users tend to interact with the driver via iPython Notebook, so clearly I’ll have to leave (fairly significant amounts of) ram for that. But I should be able to write a one liner into the spark-env.sh that will determine whether it’s on a 128 or 256GB node and have it size itself

spark.driver.memory meaning

2016-04-03 Thread Carlile, Ken
In the spark-env.sh example file, the comments indicate that the spark.driver.memory is the memory for the master in YARN mode. None of that actually makes any sense… In any case, I’m using spark in a standalone mode, running the driver on a separate machine from the master. I have a few

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
a max of 48GB (and it goes frequently beyond that). You will be better off using a lower number there and instead increasing the parallelism of your job (i.e. dividing the job into more and smaller partitions). On Sat, Mar 26, 2016 at 7:10 AM, Carlile, Ken <carli...@janelia.hhmi.o

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
normal circumstances this would leave parts of your resources underutilized). Hope this helps! -Sven On Fri, Mar 25, 2016 at 10:41 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: Further data on this.  I’m watching another job right now where there are 16 py

Re: Limit pyspark.daemon threads

2016-03-25 Thread Carlile, Ken
own, driving the load up. I’m hoping someone has seen something like this.  —Ken On Mar 21, 2016, at 3:07 PM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a littl

Re: Limit pyspark.daemon threads

2016-03-21 Thread Carlile, Ken
Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken <carli...@janelia.hhmi.org> wrote: Hello, We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes i

Re: Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
  512m       Amount of memory to use per python worker process during aggregation, in the same     format as JVM memory strings (e.g. 512m, 2g). If the memory     used during aggregation goes above this amount, it will spill the data into disks.   On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken

Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
Hello, We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes in the cluster have 16 cores and 128GB of RAM. My users use pyspark heavily. We’ve

building spark 1.6.0 fails

2016-01-28 Thread Carlile, Ken
I am attempting to build Spark 1.6.0 from source on EL 6.3, using Oracle jdk 1.8.0.45, Python 2.7.6, and Scala 2.10.3. When I try to issue build/mvn/ -DskipTests clean package, I get the following: [INFO] Using zinc server for incremental compilation [info] Compiling 3 Java sources to

Re: Applicaiton Detail UI change

2015-12-21 Thread Carlile, Ken
I start the spark master with $SPARK_HOME/sbin/start-master.sh, but I use the following to start the workers: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://$MASTER:7077 see my blog for more details, although I need to update the posts based on what I’ve changed