General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi,

Is there a document/link that describes the general configuration settings to 
achieve maximum Spark Performance while running on CDH5? In our environment, we 
did lot of changes (and still doing it) to get decent performance otherwise our 
6 node dev cluster with default configurations, lags behind a single laptop 
running Spark.

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would 
be really great. Any pointers in this regards will be really helpful.

We are running Spark 1.2.0 on CDH 5.3.0.

Thanks,

Manish Gupta
Specialist | Sapient Global Markets

Green Boulevard (Tower C)
3rd  4th Floor
Plot No. B-9A, Sector 62
Noida 201 301
Uttar Pradesh, India

Tel: +91 (120) 479 5000
Fax: +91 (120) 479 5001
Email: mgupt...@sapient.com

sapientglobalmarkets.com

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any (your) computer.

***Please consider the environment before printing this email.***



RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Thanks Evo. Yes, my concern is only regarding the infrastructure 
configurations. Basically, configuring Yarn (Node manager) + Spark is must and 
default setting never works. And what really happens, is we make changes as and 
when an issue is faced because of one of the numerous default configuration 
settings. And every time, we have to google a lot to decide on the right values 
:)

Again, my issue is very centric to running Spark on Yarn in CDH5 environment.

If you know a link that talks about optimum configuration settings for running 
Spark on Yarn (CDH5), please share the same.

Thanks,
Manish

From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 10:38 PM
To: Manish Gupta 8; user@spark.apache.org
Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance

Well there are a number of performance tuning guidelines in dedicated sections 
of the spark documentation - have you read and applied them

Secondly any performance problem within a distributed cluster environment has 
two aspects:


1.   Infrastructure

2.   App Algorithms

You seem to be focusing only on 1, but what you said about the performance 
differences between single laptop and cluster points to potential algorithmic 
inefficiency in your app when e.g. distributing and performing parallel 
processing and data. On a single laptop data moves instantly between workers 
because all worker instances run in the memory of a single machine 

Regards,
Evo Eftimov

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Thursday, April 16, 2015 6:03 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: General configurations on CDH5 to achieve maximum Spark Performance

Hi,

Is there a document/link that describes the general configuration settings to 
achieve maximum Spark Performance while running on CDH5? In our environment, we 
did lot of changes (and still doing it) to get decent performance otherwise our 
6 node dev cluster with default configurations, lags behind a single laptop 
running Spark.

Having a standard checklist (taking a base node size of 4-CPU, 16GB RAM) would 
be really great. Any pointers in this regards will be really helpful.

We are running Spark 1.2.0 on CDH 5.3.0.

Thanks,

Manish Gupta
Specialist | Sapient Global Markets

Green Boulevard (Tower C)
3rd  4th Floor
Plot No. B-9A, Sector 62
Noida 201 301
Uttar Pradesh, India

Tel: +91 (120) 479 5000
Fax: +91 (120) 479 5001
Email: mgupt...@sapient.commailto:mgupt...@sapient.com

sapientglobalmarkets.com

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any (your) computer.

***Please consider the environment before printing this email.***



RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
Thanks for the information Andy. I will go through the versions mentioned in 
Dependencies.scala to identify the compatibility.

Regards,
Manish


From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, April 07, 2015 11:04 AM
To: Manish Gupta 8; user@spark.apache.org
Subject: Re: Spark 1.2.0 with Play/Activator


Hello Manish,

you can take a look at the spark-notebook build, it's a bit tricky to get rid 
of some clashes but at least you can refer to this build to have ideas.
LSS, I have stripped out akka from play deps.

ref:
https://github.com/andypetrella/spark-notebook/blob/master/build.sbt
https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala
https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala

HTH, cheers
andy

Le mar 7 avr. 2015 07:26, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com a écrit :
Hi,

We are trying to build a Play framework based web application integrated with 
Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with 
akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We 
have tried Play 2.2.6 as well as Activator 1.3.2.

If anyone has successfully integrated Spark 1.2.0 with Play/Activator, please 
share the version we should use and akka dependencies we should add in 
Build.sbt.

Thanks,
Manish


RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
If I try to build spark-notebook with spark.version=1.2.0-cdh5.3.0, sbt 
throw these warnings before failing to compile:

:: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found
:: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found

Any suggestions?

Thanks

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Tuesday, April 07, 2015 12:04 PM
To: andy petrella; user@spark.apache.org
Subject: RE: Spark 1.2.0 with Play/Activator

Thanks for the information Andy. I will go through the versions mentioned in 
Dependencies.scala to identify the compatibility.

Regards,
Manish


From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, April 07, 2015 11:04 AM
To: Manish Gupta 8; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark 1.2.0 with Play/Activator


Hello Manish,

you can take a look at the spark-notebook build, it's a bit tricky to get rid 
of some clashes but at least you can refer to this build to have ideas.
LSS, I have stripped out akka from play deps.

ref:
https://github.com/andypetrella/spark-notebook/blob/master/build.sbt
https://github.com/andypetrella/spark-notebook/blob/master/project/Dependencies.scala
https://github.com/andypetrella/spark-notebook/blob/master/project/Shared.scala

HTH, cheers
andy

Le mar 7 avr. 2015 07:26, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com a écrit :
Hi,

We are trying to build a Play framework based web application integrated with 
Apache Spark. We are running Apache Spark 1.2.0 CDH 5.3.0. But struggling with 
akka version conflicts (errors like java.lang.NoSuchMethodError in akka). We 
have tried Play 2.2.6 as well as Activator 1.3.2.

If anyone has successfully integrated Spark 1.2.0 with Play/Activator, please 
share the version we should use and akka dependencies we should add in 
Build.sbt.

Thanks,
Manish


RE: Port configuration for BlockManagerId

2015-03-29 Thread Manish Gupta 8
Has anyone else faced this issue of running spark-shell (yarn client mode) in 
an environment with strict firewall rules (on fixed allowed incoming ports)? 
How can this be rectified?

Thanks,
Manish

From: Manish Gupta 8
Sent: Thursday, March 26, 2015 4:09 PM
To: user@spark.apache.org
Subject: Port configuration for BlockManagerId

Hi,

I am running spark-shell and connecting with a yarn cluster with deploy mode as 
client. In our environment, there are some security policies that doesn't 
allow us to open all TCP port.
Issue I am facing is: Spark Shell driver is using a random port for 
BlockManagerID - BlockManagerId(driver, host-name, 52131).

Is there any configuration I can use to fix this random port behavior?

I am running Spark 1.2.0 on CDH 5.3.0.

Thanks,
Manish






Port configuration for BlockManagerId

2015-03-26 Thread Manish Gupta 8
Hi,

I am running spark-shell and connecting with a yarn cluster with deploy mode as 
client. In our environment, there are some security policies that doesn't 
allow us to open all TCP port.
Issue I am facing is: Spark Shell driver is using a random port for 
BlockManagerID - BlockManagerId(driver, host-name, 52131).

Is there any configuration I can use to fix this random port behavior?

I am running Spark 1.2.0 on CDH 5.3.0.

Thanks,
Manish






RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Hi Reza,

Behavior:

· I tried running the job with different thresholds - 0.1, 0.5, 5, 20  
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

· For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread main java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

· If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data  environment:

· RowMatrix of size 43345 X 56431

· In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

· I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all  2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta





RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense.

Regards,
Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a 
single row is dense, that can end up overwhelming a machine. You can push that 
up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: 
so it scales linearly and across cluster with rows, but still quadratically 
with the number of columns. I will be updating the documentation to make this 
clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi Reza,

Behavior:

• I tried running the job with different thresholds - 0.1, 0.5, 5, 20  
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118attempt=0
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

• For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread main java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

• If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data  environment:

• RowMatrix of size 43345 X 56431

• In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

• I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.commailto:mgupt...@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring

Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all  2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta




RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mgupt...@sapient.commailto:mgupt...@sapient.com wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all  2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta