Re: spark.executor.cores

Jean Georges Perrin Fri, 15 Jul 2016 13:36:18 -0700

Hey Mich,


Oh well, you know, us humble programmers try to modestly understand what the 
brilliant data scientists are designing and, I can assure you that it is not 
easy.

Basically the way I use Spark is in 2 ways:

1) As a developer
I just embed the Spark binaries (jars) in my Maven POM. In the app, when I need 
to have Spark do something, I just call the local's master (quick example here: 
http://jgp.net/2016/06/26/your-very-first-apache-spark-application/ 
<http://jgp.net/2016/06/26/your-very-first-apache-spark-application/>).

Pro: this is the super-duper easy & lazy way, works like a charm, setup under 5 
minutes with one arm in your back and being blindfolded.
Con: well, I have a MacBook Air, a nice MacBook Air, but still it is only a 
MacBook Air, with 8GB or RAM and 2 cores... My analysis never finished (but a 
subset does).

2) As a database
Ok, some will probably find that shocking, but I used Spark as a database on a 
distance computer (my sweet Micha). The app connects to Spark, tells it what to 
do, and the application "consumes" the data crunching done by Spark on Micha (a 
bit more of the architecture there: 
http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/ 
<http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/>). 

Pro: this can scale like crazy (I have benchmarks scheduled)
Con: well... after you went through all the issues I had, I don't see much 
issues anymore (except that I still can't set the # of executors -- which 
starts to make sense).

3) As a remote batch processor
You prepare your "batch" as a jar. I remember using mainframes this way (and 
using SAS). 

Pro: very friendly to data scientists / researchers as they are used to this 
batch model.
Con: you need to prepare the batch, send it... The jar also needs to do with 
the results: save them in a database? send a mail? send a PDF? call the police?

Do you agree? Any other opinion?

I am not saying one is better than the other, just trying to get a "big 
picture".

jg




> On Jul 15, 2016, at 2:13 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Interesting
> 
> For some stuff I create an uber jar file and use that against spark-submit. I 
> have not attempted to start the cluster from through application.
> 
> 
> I tend to use a shell program (actually a k-shell) to compile it via maven or 
> sbt and then run it accordingly. In general you can parameterise everything 
> for runtime parameters say --driver-memory ${DRIVER_MEMORY} to practically 
> any other parameter . That way I find it more flexible because I can submit 
> the jar file and the class in any environment and adjust those runtime 
> parameters accordingly.  There are certain advantages to using spark-submit, 
> for example, since driver-memory setting encapsulates the JVM, you will need 
> to set the amount of driver memory for any non-default value before starting 
> JVM by providing the value in spark-submit.
> 
> I would be keen in hearing the pros and cons of the above approach. I am sure 
> you programmers (Scala/Java) know much more than me :)
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 15 July 2016 at 16:42, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> lol - young padawan I am and path to knowledge seeking I am...
> 
> And on this path I also tried (without luck)...
> 
>               if (restId == 0) {
>                       conf = conf.setExecutorEnv("spark.executor.cores", 
> "22");
>               } else {
>                       conf = conf.setExecutorEnv("spark.executor.cores", "2");
>               }
> 
> and
> 
>               if (restId == 0) {
>                       conf.setExecutorEnv("spark.executor.cores", "22");
>               } else {
>                       conf.setExecutorEnv("spark.executor.cores", "2");
>               }
> 
> the only annoying thing I see is we designed some of the work to be handled 
> by the driver/client app and we will have to rethink a bit the design of the 
> app for that...
> 
> 
>> On Jul 15, 2016, at 11:34 AM, Daniel Darabos 
>> <daniel.dara...@lynxanalytics.com <mailto:daniel.dara...@lynxanalytics.com>> 
>> wrote:
>> 
>> Mich's invocation is for starting a Spark application against an already 
>> running Spark standalone cluster. It will not start the cluster for you.
>> 
>> We used to not use "spark-submit", but we started using it when it solved 
>> some problem for us. Perhaps that day has also come for you? :)
>> 
>> On Fri, Jul 15, 2016 at 5:14 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> I don't use submit: I start my standalone cluster and connect to it 
>> remotely. Is that a bad practice?
>> 
>> I'd like to be able to it dynamically as the system knows whether it needs 
>> more or less resources based on its own  context
>> 
>>> On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> You can also do all this at env or submit time with spark-submit which I 
>>> believe makes it more flexible than coding in.
>>> 
>>> Example
>>> 
>>> ${SPARK_HOME}/bin/spark-submit \
>>>                 --packages com.databricks:spark-csv_2.11:1.3.0 \
>>>                 --driver-memory 2G \
>>>                 --num-executors 2 \
>>>                 --executor-cores 3 \
>>>                 --executor-memory 2G \
>>>                 --master spark://50.140.197.217:7077 
>>> <http://50.140.197.217:7077/> \
>>>                 --conf "spark.scheduler.mode=FAIR" \
>>>                 --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
>>> -XX:+PrintGCTimeStamps" \
>>>                 --jars 
>>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>>>                 --class "${FILE_NAME}" \
>>>                 --conf "spark.ui.port=${SP}" \
>>>  
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 15 July 2016 at 13:48, Jean Georges Perrin <j...@jgp.net 
>>> <mailto:j...@jgp.net>> wrote:
>>> Merci Nihed, this is one of the tests I did :( still not working
>>> 
>>> 
>>> 
>>>> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihe...@gmail.com 
>>>> <mailto:nihe...@gmail.com>> wrote:
>>>> 
>>>> can you try with : 
>>>> SparkConf conf = new SparkConf().setAppName("NC Eatery 
>>>> app").set("spark.executor.memory", "4g")
>>>>                            .setMaster("spark://10.0.100.120:7077 <>");
>>>>            if (restId == 0) {
>>>>                    conf = conf.set("spark.executor.cores", "22");
>>>>            } else {
>>>>                    conf = conf.set("spark.executor.cores", "2");
>>>>            }
>>>>            JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <j...@jgp.net 
>>>> <mailto:j...@jgp.net>> wrote:
>>>> Hi,
>>>> 
>>>> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>>>> 
>>>> My process uses all the cores of my server (good), but I am trying to 
>>>> limit it so I can actually submit a second job.
>>>> 
>>>> I tried
>>>> 
>>>>            SparkConf conf = new SparkConf().setAppName("NC Eatery 
>>>> app").set("spark.executor.memory", "4g")
>>>>                            .setMaster("spark://10.0.100.120:7077 <>");
>>>>            if (restId == 0) {
>>>>                    conf = conf.set("spark.executor.cores", "22");
>>>>            } else {
>>>>                    conf = conf.set("spark.executor.cores", "2");
>>>>            }
>>>>            JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> and
>>>> 
>>>>            SparkConf conf = new SparkConf().setAppName("NC Eatery 
>>>> app").set("spark.executor.memory", "4g")
>>>>                            .setMaster("spark://10.0.100.120:7077 <>");
>>>>            if (restId == 0) {
>>>>                    conf.set("spark.executor.cores", "22");
>>>>            } else {
>>>>                    conf.set("spark.executor.cores", "2");
>>>>            }
>>>>            JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>>> 
>>>> but it does not seem to take it. Any hint?
>>>> 
>>>> jg
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> 
>>>> M'BAREK Med Nihed,
>>>> Fedora Ambassador, TUNISIA, Northern Africa
>>>> http://www.nihed.com <http://www.nihed.com/>
>>>> 
>>>>  <http://tn.linkedin.com/in/nihed>
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: spark.executor.cores

Reply via email to