General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
. Thanks, Manish From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 10:38 PM To: Manish Gupta 8; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Well there are a number of performance tuning guidelines in dedicated

RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
Thanks for the information Andy. I will go through the versions mentioned in Dependencies.scala to identify the compatibility. Regards, Manish From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Tuesday, April 07, 2015 11:04 AM To: Manish Gupta 8; user@spark.apache.org Subject: Re

RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
If I try to build spark-notebook with spark.version=1.2.0-cdh5.3.0, sbt throw these warnings before failing to compile: :: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found :: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found Any suggestions? Thanks From: Manish Gupta 8

RE: Port configuration for BlockManagerId

2015-03-29 Thread Manish Gupta 8
Has anyone else faced this issue of running spark-shell (yarn client mode) in an environment with strict firewall rules (on fixed allowed incoming ports)? How can this be rectified? Thanks, Manish From: Manish Gupta 8 Sent: Thursday, March 26, 2015 4:09 PM To: user@spark.apache.org Subject

Port configuration for BlockManagerId

2015-03-26 Thread Manish Gupta 8
Hi, I am running spark-shell and connecting with a yarn cluster with deploy mode as client. In our environment, there are some security policies that doesn't allow us to open all TCP port. Issue I am facing is: Spark Shell driver is using a random port for BlockManagerID -

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
this is a hardware size issue and we should test it on larger machines? Regards, Manish From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Wednesday, March 18, 2015 11:20 PM To: Reza Zadeh Cc: user@spark.apache.org Subject: RE: Column Similarity using DIMSUM Hi Reza, I have tried

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense. Regards, Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, March 19, 2015 11:58 PM To: Manish Gupta 8 Cc: user@spark.apache.org Subject: Re: Column Similarity using DIMSUM Hi Manish, With 56431 columns, the output can be as large as 56431

Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi, I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on a dataset that looks like (Entity, Attribute, Value) after transforming the same to a row-oriented dense matrix format (one line per Attribute, one column per Entity, each cell with normalized value – between 0

RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza, I have tried threshold to be only in the range of 0 to 1. I was not aware that threshold can be set to above 1. Will try and update. Thank You - Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Wednesday, March 18, 2015 10:55 PM To: Manish Gupta 8 Cc: user@spark.apache.org