RE: General configurations on CDH5 to achieve maximum Spark Performance
Thanks Evo. Yes, my concern is only regarding the infrastructure configurations. Basically, configuring Yarn (Node manager) + Spark is must and default setting never works. And what really happens, is we make changes as and when an issue is faced because of one of the numerous default configuration settings. And every time, we have to google a lot to decide on the right values :) Again, my issue is very centric to running Spark on Yarn in CDH5 environment. If you know a link that talks about optimum configuration settings for running Spark on Yarn (CDH5), please share the same. Thanks, Manish From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 10:38 PM To: Manish Gupta 8; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Well there are a number of performance tuning guidelines in dedicated sections of the spark documentation - have you read and applied them Secondly any performance problem within a distributed cluster environment has two aspects: 1. Infrastructure 2. App Algorithms You seem to be focusing only on 1, but what you said about the performance differences between single laptop and cluster points to potential algorithmic inefficiency in your app when e.g. distributing and performing parallel processing and data. On a single laptop data moves instantly between workers because all worker instances run in the memory of a single machine Regards, Evo Eftimov From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:03 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: General configurations on CDH5 to achieve maximum Spark Performance Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags behind a single laptop running Spark. Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would be really great. Any pointers in this regards will be really helpful. We are running Spark 1.2.0 on CDH 5.3.0. Thanks, Manish Gupta Specialist | Sapient Global Markets Green Boulevard (Tower C) 3rd & 4th Floor Plot No. B-9A, Sector 62 Noida 201 301 Uttar Pradesh, India Tel: +91 (120) 479 5000 Fax: +91 (120) 479 5001 Email: mgupt...@sapient.com<mailto:mgupt...@sapient.com> sapientglobalmarkets.com The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any (your) computer. ***Please consider the environment before printing this email.***
General configurations on CDH5 to achieve maximum Spark Performance
Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags behind a single laptop running Spark. Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would be really great. Any pointers in this regards will be really helpful. We are running Spark 1.2.0 on CDH 5.3.0. Thanks, Manish Gupta Specialist | Sapient Global Markets Green Boulevard (Tower C) 3rd & 4th Floor Plot No. B-9A, Sector 62 Noida 201 301 Uttar Pradesh, India Tel: +91 (120) 479 5000 Fax: +91 (120) 479 5001 Email: mgupt...@sapient.com sapientglobalmarkets.com The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any (your) computer. ***Please consider the environment before printing this email.***
RE: Spark 1.2.0 with Play/Activator
If I try to build spark-notebook with "spark.version"="1.2.0-cdh5.3.0", sbt throw these warnings before failing to compile: :: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found :: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found Any suggestions? Thanks From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Tuesday, April 07, 2015 12:04 PM To: andy petrella; user@spark.apache.org Subject: RE: Spark 1.2.0 with Play/Activator Thanks for the information Andy. I will go through the versions mentioned in Dependencies.scala to identify the compatibility. Regards, Manish From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, April 07, 2015 11:04 AM To: Manish Gupta 8; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark 1.2.0 with Play/Activator Hello Manish, you can take a look at the spark-notebook build, it's a bit tricky to get rid of some clashes but at least you can refer to this build to have ideas. LSS, I have stripped out akka from play deps. ref: https://github.com/andypetrella/spark-notebook/blob/master/build.sbt https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala HTH, cheers andy Le mar 7 avr. 2015 07:26, Manish Gupta 8 mailto:mgupt...@sapient.com>> a écrit : Hi, We are trying to build a Play framework based web application integrated with Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as Activator 1.3.2. If anyone has successfully integrated Spark 1.2.0 with Play/Activator, please share the version we should use and akka dependencies we should add in Build.sbt. Thanks, Manish
RE: Spark 1.2.0 with Play/Activator
Thanks for the information Andy. I will go through the versions mentioned in Dependencies.scala to identify the compatibility. Regards, Manish From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, April 07, 2015 11:04 AM To: Manish Gupta 8; user@spark.apache.org Subject: Re: Spark 1.2.0 with Play/Activator Hello Manish, you can take a look at the spark-notebook build, it's a bit tricky to get rid of some clashes but at least you can refer to this build to have ideas. LSS, I have stripped out akka from play deps. ref: https://github.com/andypetrella/spark-notebook/blob/master/build.sbt https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala HTH, cheers andy Le mar 7 avr. 2015 07:26, Manish Gupta 8 mailto:mgupt...@sapient.com>> a écrit : Hi, We are trying to build a Play framework based web application integrated with Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as Activator 1.3.2. If anyone has successfully integrated Spark 1.2.0 with Play/Activator, please share the version we should use and akka dependencies we should add in Build.sbt. Thanks, Manish
Spark 1.2.0 with Play/Activator
Hi, We are trying to build a Play framework based web application integrated with Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We have tried Play 2.2.6 as well as Activator 1.3.2. If anyone has successfully integrated Spark 1.2.0 with Play/Activator, please share the version we should use and akka dependencies we should add in Build.sbt. Thanks, Manish
RE: Port configuration for BlockManagerId
Has anyone else faced this issue of running spark-shell (yarn client mode) in an environment with strict firewall rules (on fixed allowed incoming ports)? How can this be rectified? Thanks, Manish From: Manish Gupta 8 Sent: Thursday, March 26, 2015 4:09 PM To: user@spark.apache.org Subject: Port configuration for BlockManagerId Hi, I am running spark-shell and connecting with a yarn cluster with deploy mode as "client". In our environment, there are some security policies that doesn't allow us to open all TCP port. Issue I am facing is: Spark Shell driver is using a random port for BlockManagerID - BlockManagerId(, host-name, 52131). Is there any configuration I can use to fix this random port behavior? I am running Spark 1.2.0 on CDH 5.3.0. Thanks, Manish
Port configuration for BlockManagerId
Hi, I am running spark-shell and connecting with a yarn cluster with deploy mode as "client". In our environment, there are some security policies that doesn't allow us to open all TCP port. Issue I am facing is: Spark Shell driver is using a random port for BlockManagerID - BlockManagerId(, host-name, 52131). Is there any configuration I can use to fix this random port behavior? I am running Spark 1.2.0 on CDH 5.3.0. Thanks, Manish
RE: Column Similarity using DIMSUM
Thanks Reza. It makes perfect sense. Regards, Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, March 19, 2015 11:58 PM To: Manish Gupta 8 Cc: user@spark.apache.org Subject: Re: Column Similarity using DIMSUM Hi Manish, With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a single row is dense, that can end up overwhelming a machine. You can push that up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: so it scales linearly and across cluster with rows, but still quadratically with the number of columns. I will be updating the documentation to make this clear. Best, Reza On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 mailto:mgupt...@sapient.com>> wrote: Hi Reza, Behavior: • I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 100. Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> with all workers running on 100% CPU. There is hardly any shuffle read/write happening. And after some time, “ERROR YarnClientClusterScheduler: Lost executor” start showing (maybe because of the nodes running on 100% CPU). • For threshold 200+ (tried up to 1000) it gave an error (here was different for different thresholds) Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Oversampling should be greater than 1: 0. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511) at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492) at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241) at EntitySimilarity$.main(EntitySimilarity.scala:80) at EntitySimilarity.main(EntitySimilarity.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) • If I get rid of frequently occurring attributes and keep only those attributes which are occurring in at 2% entities, then job doesn’t stuck / fail. Data & environment: • RowMatrix of size 43345 X 56431 • In the matrix there are couple of rows, whose value is same in up to 50% of the columns (frequently occurring attributes). • I am running this, on one of our Dev cluster running on CDH 5.3.0 5 data nodes (each 4-core and 16GB RAM). My question – Do you think this is a hardware size issue and we should test it on larger machines? Regards, Manish From: Manish Gupta 8 [mailto:mgupt...@sapient.com<mailto:mgupt...@sapient.com>] Sent: Wednesday, March 18, 2015 11:20 PM To: Reza Zadeh Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: RE: Column Similarity using DIMSUM Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Column Similarity using DIMSUM Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mailto:mgupt...@sapient.com>> wrote: Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1). It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too). If none of the a
RE: Column Similarity using DIMSUM
Hi Reza, Behavior: · I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 100. Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0> with all workers running on 100% CPU. There is hardly any shuffle read/write happening. And after some time, “ERROR YarnClientClusterScheduler: Lost executor” start showing (maybe because of the nodes running on 100% CPU). · For threshold 200+ (tried up to 1000) it gave an error (here was different for different thresholds) Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Oversampling should be greater than 1: 0. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511) at org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492) at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241) at EntitySimilarity$.main(EntitySimilarity.scala:80) at EntitySimilarity.main(EntitySimilarity.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) · If I get rid of frequently occurring attributes and keep only those attributes which are occurring in at 2% entities, then job doesn’t stuck / fail. Data & environment: · RowMatrix of size 43345 X 56431 · In the matrix there are couple of rows, whose value is same in up to 50% of the columns (frequently occurring attributes). · I am running this, on one of our Dev cluster running on CDH 5.3.0 5 data nodes (each 4-core and 16GB RAM). My question – Do you think this is a hardware size issue and we should test it on larger machines? Regards, Manish From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Wednesday, March 18, 2015 11:20 PM To: Reza Zadeh Cc: user@spark.apache.org Subject: RE: Column Similarity using DIMSUM Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Column Similarity using DIMSUM Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mailto:mgupt...@sapient.com>> wrote: Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1). It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too). If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 100 Entities X 1 attributes) and results are very accurate. I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM. My question is - Is this behavior expected for datasets where some Attributes frequently occur? Thanks, Manish Gupta
RE: Column Similarity using DIMSUM
Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org Subject: Re: Column Similarity using DIMSUM Hi Manish, Did you try calling columnSimilarities(threshold) with different threshold values? You try threshold values of 0.1, 0.5, 1, and 20, and higher. Best, Reza On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 mailto:mgupt...@sapient.com>> wrote: Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1). It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too). If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 100 Entities X 1 attributes) and results are very accurate. I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM. My question is - Is this behavior expected for datasets where some Attributes frequently occur? Thanks, Manish Gupta
Column Similarity using DIMSUM
Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0 and 1). It runs extremely fast in computing similarities between Entities in most of the case, but if there is even a single attribute which is frequently occurring across the entities (say in 30% of entities), job falls apart. Whole job get stuck and worker nodes start running on 100% CPU without making any progress on the job stage. If the dataset is very small (in the range of 1000 Entities X 500 attributes (some frequently occurring)) the job finishes but takes too long (some time it gives GC errors too). If none of the attribute is frequently occurring (all < 2%), then job runs in a lightning fast manner (even for 100 Entities X 1 attributes) and results are very accurate. I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 16GB of RAM. My question is - Is this behavior expected for datasets where some Attributes frequently occur? Thanks, Manish Gupta