RE: Spark on Kubernetes - log4j.properties not read

2019-06-11 Thread Dave Jaffe
That did the trick, Abhishek! Thanks for the explanation, that answered a lot
of questions I had.

Dave



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Dave Jaffe
I am using Spark on Kubernetes from Spark 2.4.3. I have created a 
log4j.properties file in my local spark/conf directory and modified it so that 
the console (or, in the case of Kubernetes, the log) only shows warnings and 
higher (log4j.rootCategory=WARN, console). I then added the command
COPY conf /opt/spark/conf
to /root/spark/kubernetes/dockerfiles/spark/Dockerfile and built a new 
container.

However, when I run that under Kubernetes, the program runs successfully but 
/opt/spark/conf/log4j.properties is not used (I still see the INFO lines when I 
run kubectl logs ).

I have tried other things such as explicitly adding a –properties-file to my 
spark-submit command and even
--conf 
spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/conf/log4j.properties

My log4j.properties file is never seen.

How do I customize log4j.properties with Kubernetes?

Thanks, Dave Jaffe



Re: Running stress tests on spark cluster to avoid wild-goose chase later

2016-11-15 Thread Dave Jaffe
Mich-

Sparkperf from Databricks (https://github.com/databricks/spark-perf) is a good 
stress test, covering a wide range of Spark functionality but especially ML. 
I’ve tested it with Spark 1.6.0 on CDH 5.7. It may need some work for Spark 2.0.

Dave Jaffe

BigData Performance
VMware
dja...@vmware.com

From: Mich Talebzadeh 
Date: Tuesday, November 15, 2016 at 11:09 AM
To: "user @spark" 
Subject: Running stress tests on spark cluster to avoid wild-goose chase later

Hi,

This is rather a broad question.

We would like to run a set of stress tests against our Spark clusters to ensure 
that the build performs as expected before deploying the cluster.

Reasoning behind this is that the users were reporting some ML jobs running on 
two equal clusters reporting back different times, one cluster was behaving 
much worse than other using the same workload.

This was eventually traced to wrong BIOS setting at hardware level and did not 
have anything to do with Spark itself.

So rather spending a good while doing wild-goose chase, we would like to take 
spark app through some tests cycles.

We have some ideas but appreciate some other feedbacks.

The current version is CHDS 5.2.

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_profile_view-3Fid-3DAAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw&d=CwMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZVa_NfRWb4LTiT6_IVstUCci54W90AgDk7po0Fiao_o&m=wiCWSz9X6j73L9qSOVRiIF9IkPVl6k6FLRg4xtXoSB4&s=t-NkpQbe3_A_BKcpsWZVhI-BBq7lcZzqOW-8X43il_0&e=>



http://talebzadehmich.wordpress.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com&d=CwMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZVa_NfRWb4LTiT6_IVstUCci54W90AgDk7po0Fiao_o&m=wiCWSz9X6j73L9qSOVRiIF9IkPVl6k6FLRg4xtXoSB4&s=ezSuGAqAyEhd1YVeV1slP5csMpLGRIp3JAqsFm3d0xw&e=>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Re: Anomalous Spark RDD persistence behavior

2016-11-08 Thread Dave Jaffe
No, I am not using serializing either with memory or disk.

Dave Jaffe
VMware
dja...@vmware.com

From: Shreya Agarwal 
Date: Monday, November 7, 2016 at 3:29 PM
To: Dave Jaffe , "user@spark.apache.org" 

Subject: RE: Anomalous Spark RDD persistence behavior

I don’t think this is correct. Unless you are serializing when caching to 
memory but not serializing when persisting to disk. Can you check?

Also, I have seen the behavior where if I have 100 GB in-memory cache and I use 
60 GB to persist something (MEMORY_AND_DISK). Then try to persist another RDD 
with MEMORY_AND_DISK option which is much greater than the remaining 40 GB 
(lets say 1 TB), my executors start getting killed at one point. During this 
period, the memory usage goes above 100GB and after some extra usage it fails. 
It seems like Spark is trying to cache this new RDD to memory and move the old 
one out to disk. But it is not able to move the old one out fast enough and 
crashes with OOM. Anyone seeing that?

From: Dave Jaffe [mailto:dja...@vmware.com]
Sent: Monday, November 7, 2016 2:07 PM
To: user@spark.apache.org
Subject: Anomalous Spark RDD persistence behavior

I’ve been studying Spark RDD persistence with spark-perf 
(https://github.com/databricks/spark-perf)<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dperf-29&d=CwMGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZVa_NfRWb4LTiT6_IVstUCci54W90AgDk7po0Fiao_o&m=iiGqgoQYFE1OVp2j1UhDscHx7Z43giXIqVGZT3tIh-c&s=4Tc6SS14pBg3pu4jq344GWsDzkqfY7WYaMsp9KXGNEg&e=>,
 especially when the dataset size starts to exceed available memory.

I’m running Spark 1.6.0 on YARN with CDH 5.7. I have 10 NodeManager nodes, each 
with 16 vcores and 32 GB of container memory. So I’m running 39 executors with 
4 cores and 8 GB each (6 GB spark.executor.memory and 2 GB 
spark.yarn.executor.memoryOverhead). I am using the default values for 
spark.memory.fraction and spark.memory.storageFraction so I end up with 3.1 GB 
available for caching RDDs, for a total of about 121 GB.

I’m running a single Random Forest test, with 500 features and up to 40 million 
examples, with 1 partition per core or 156 total partitions. The code (at line 
https://github.com/databricks/spark-perf/blob/master/mllib-tests/v1p5/src/main/scala/mllib/perf/MLAlgorithmTests.scala#L653)<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dperf_blob_master_mllib-2Dtests_v1p5_src_main_scala_mllib_perf_MLAlgorithmTests.scala-23L653-29&d=CwMGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=ZVa_NfRWb4LTiT6_IVstUCci54W90AgDk7po0Fiao_o&m=iiGqgoQYFE1OVp2j1UhDscHx7Z43giXIqVGZT3tIh-c&s=rxTL0ohQ2q5aJ03gaOxADdEyUOOX5xUf7pmmEsaQ7oE&e=>
 caches the input RDD immediately after creation. At 30M examples this fits 
into memory with all 156 partitions cached, with a total 113.4 GB in memory, or 
4 blocks of about 745 MB each per executor. So far so good.

At 40M examples, I expected about 3 partitions to fit in memory per executor, 
or 75% to be cached. However, I found only 3 partitions across the cluster were 
cached, or 2%, for a total size in memory of 2.9GB. Three of the executors had 
one block of 992 MB cached, with 2.1 GB free (enough for 2 more blocks). The 
other 36 held no blocks, with 3.1 GB free (enough for 3 blocks). Why this 
dramatic falloff?

Thinking this may improve if I changed the persistence to MEMORY_AND_DISK. 
Unfortunately now the executor memory was exceeded (“Container killed by YARN 
for exceeding memory limits. 8.9 GB of 8 GB physical memory used”) and the run 
ground to a halt. Why does persisting to disk take more memory than caching to 
memory?

Is this behavior expected as dataset size exceeds available memory?

Thanks in advance,

Dave Jaffe
Big Data Performance
VMware
dja...@vmware.com<mailto:dja...@vmware.com>




Anomalous Spark RDD persistence behavior

2016-11-07 Thread Dave Jaffe
I’ve been studying Spark RDD persistence with spark-perf 
(https://github.com/databricks/spark-perf), especially when the dataset size 
starts to exceed available memory.

I’m running Spark 1.6.0 on YARN with CDH 5.7. I have 10 NodeManager nodes, each 
with 16 vcores and 32 GB of container memory. So I’m running 39 executors with 
4 cores and 8 GB each (6 GB spark.executor.memory and 2 GB 
spark.yarn.executor.memoryOverhead). I am using the default values for 
spark.memory.fraction and spark.memory.storageFraction so I end up with 3.1 GB 
available for caching RDDs, for a total of about 121 GB.

I’m running a single Random Forest test, with 500 features and up to 40 million 
examples, with 1 partition per core or 156 total partitions. The code (at line 
https://github.com/databricks/spark-perf/blob/master/mllib-tests/v1p5/src/main/scala/mllib/perf/MLAlgorithmTests.scala#L653)
 caches the input RDD immediately after creation. At 30M examples this fits 
into memory with all 156 partitions cached, with a total 113.4 GB in memory, or 
4 blocks of about 745 MB each per executor. So far so good.

At 40M examples, I expected about 3 partitions to fit in memory per executor, 
or 75% to be cached. However, I found only 3 partitions across the cluster were 
cached, or 2%, for a total size in memory of 2.9GB. Three of the executors had 
one block of 992 MB cached, with 2.1 GB free (enough for 2 more blocks). The 
other 36 held no blocks, with 3.1 GB free (enough for 3 blocks). Why this 
dramatic falloff?

Thinking this may improve if I changed the persistence to MEMORY_AND_DISK. 
Unfortunately now the executor memory was exceeded (“Container killed by YARN 
for exceeding memory limits. 8.9 GB of 8 GB physical memory used”) and the run 
ground to a halt. Why does persisting to disk take more memory than caching to 
memory?

Is this behavior expected as dataset size exceeds available memory?

Thanks in advance,

Dave Jaffe
Big Data Performance
VMware
dja...@vmware.com