Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
If you use Scala, you can do:

  val conf = new SparkConf()
 .setMaster(yarn-client)
 .setAppName(Logistic regression SGD fixed)
 .set(spark.akka.frameSize, 100)
 .setExecutorEnv(SPARK_JAVA_OPTS,  -Dspark.akka.frameSize=100)
var sc = new SparkContext(conf)


I have been struggling with this too. I was trying to run Spark on the
KDDB website which has about 29M features. It implodes and dies. Let
me know if you are able to figure out how to get things to work well
on really really wide datasets.

Regards,
Krishna

On Mon, Jul 14, 2014 at 10:18 AM, crater cq...@ucmerced.edu wrote:
 Hi xiangrui,


 Where can I set the spark.akka.frameSize ?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9616.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
That is exactly the same error that I got. I am still having no success.

Regards,
Krishna

On Mon, Jul 14, 2014 at 11:50 AM, crater cq...@ucmerced.edu wrote:
 Hi Krishna,

 Thanks for your help. Are you able to get your 29M data running yet? I fix
 the previous problem by setting larger spark.akka.frameSize, but now I get
 some other errors below. Did you get these errors before?


 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
 Akka client disassociated
 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
 Akka client disassociated
 14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
 14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
 Akka client disassociated
 14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
 14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
 Akka client disassociated
 14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
 14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
 Akka client disassociated
 14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
 14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
 Akka client disassociated
 14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
 14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
 job
 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
 host node6 failed for unknown reason
 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
I am running Spark 1.0.1 on a 5 node yarn cluster. I have set the
driver memory to 8G and executor memory to about 12G.

Regards,
Krishna


On Mon, Jul 14, 2014 at 5:56 PM, Xiangrui Meng men...@gmail.com wrote:
 Is it on a standalone server? There are several settings worthing checking:

 1) number of partitions, which should match the number of cores
 2) driver memory (you can see it from the executor tab of the Spark
 WebUI and set it with --driver-memory 10g
 3) the version of Spark you were running

 Best,
 Xiangrui

 On Mon, Jul 14, 2014 at 12:14 PM, Srikrishna S srikrishna...@gmail.com 
 wrote:
 That is exactly the same error that I got. I am still having no success.

 Regards,
 Krishna

 On Mon, Jul 14, 2014 at 11:50 AM, crater cq...@ucmerced.edu wrote:
 Hi Krishna,

 Thanks for your help. Are you able to get your 29M data running yet? I fix
 the previous problem by setting larger spark.akka.frameSize, but now I get
 some other errors below. Did you get these errors before?


 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
 Akka client disassociated
 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
 Akka client disassociated
 14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
 14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
 Akka client disassociated
 14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
 14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
 Akka client disassociated
 14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
 14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
 Akka client disassociated
 14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
 14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
 Akka client disassociated
 14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
 14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
 job
 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
 host node6 failed for unknown reason
 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at 
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am run logistic regression with SGD on a problem with about 19M
parameters (the kdda dataset from the libsvm library)

I consistently see that the nodes on my computer get disconnected and
soon the whole job goes to a grinding halt.

14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler: Lost
executor 2 on pachy4 remote Akka client disassociated

Does this have anything to do with the akka.frame_size? I have tried
upto 1024 MB and I still get the same thing.

I don't have any more information in the logs about why the clients
are getting disconnected. Any thoughts?

Regards,
Krishna


Re: Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am using the master that I compiled 2 days ago. Can you point me to the JIRA?

On Sat, Jul 12, 2014 at 9:13 AM, DB Tsai dbt...@dbtsai.com wrote:
 Are you using 1.0 or current master? A bug related to this is fixed in
 master.

 On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote:

 I am run logistic regression with SGD on a problem with about 19M
 parameters (the kdda dataset from the libsvm library)

 I consistently see that the nodes on my computer get disconnected and
 soon the whole job goes to a grinding halt.

 14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler: Lost
 executor 2 on pachy4 remote Akka client disassociated

 Does this have anything to do with the akka.frame_size? I have tried
 upto 1024 MB and I still get the same thing.

 I don't have any more information in the logs about why the clients
 are getting disconnected. Any thoughts?

 Regards,
 Krishna


Job getting killed

2014-07-11 Thread Srikrishna S
I am trying to run Logistic Regression on the url dataset (from
libsvm) using the exact same code
as the example on a 5 node Yarn-Cluster.

I get a pretty cryptic error that says

Killed

Nothing more

Settings:

  --master yarn-client
  --verbose
  --driver-memory 24G
  --executor-memory 24G
  --executor-cores 8
  --num-executors 5

I set the akka.frame_size to 200MB.


Script:

ef main(args: Array[String]) {

val conf = new SparkConf()
 .setMaster(yarn-client)
 .setAppName(Logistic regression SGD fixed)
 .set(spark.akka.frameSize, 200)
var sc = new SparkContext(conf)

// Load and parse the data
val dataset = args(0)
val maxIterations = 100
val start_time = System.nanoTime()
val data = MLUtils.loadLibSVMFile(sc, dataset)

// Building the model
var solver = new LogisticRegressionWithSGD()
solver.optimizer.setNumIterations(maxIterations)
solver.optimizer.setRegParam(0.01)
val model = solver.run(data)

   // Measure the accuracy. Don't measure the time taken to do this.
   val preditionsAndLabels = data.map { point =
 val prediction = model.predict(point.features)
 (prediction, point.label)
   }

   val accuracy = (preditionsAndLabels.filter(r = r._1 ==
r._2).count.toDouble) / data.count
   val elapsed_time = (System.nanoTime() - start_time) / 1e9

   // User the last known accuracy
   println(dataset + ,spark-sgd, + maxIterations + ,  +
elapsed_time + , + accuracy)
   System.exit(0)
  }


Re: Spark Installation

2014-07-08 Thread Srikrishna S
Hi All,


I tried the make distribution script and it worked well. I was able to
compile the spark binary on our CDH5 cluster. Once I compiled Spark, I
copied over the binaries in the dist folder to all the other machines
in the cluster.

However, I run into an issue while submit a job in yarn-client mode. I
get an error message that says the following
Resource 
file:/opt/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar
changed on src filesystem (expected 1404845211000, was 1404845404000)

My end goal is to submit a job (that uses MLLib) in our Yarn cluster.

Any thoughts anyone?

Regards,
Krishna



On Tue, Jul 8, 2014 at 9:49 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Srikrishna,

 The binaries are built with something like
 mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 
 -Dyarn.version=2.3.0-cdh5.0.1

 -Sandy


 On Tue, Jul 8, 2014 at 3:14 AM, 田毅 tia...@asiainfo.com wrote:

 try this command:

 make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive




 田毅
 ===
 橘云平台产品线
 大数据产品部
 亚信联创科技(中国)有限公司
 手机:13910177261
 电话:010-82166322
 传真:010-82166617
 Q Q:20057509
 MSN:yi.t...@hotmail.com
 地址:北京市海淀区东北旺西路10号院东区  亚信联创大厦


 ===

 在 2014年7月8日,上午11:53,Krishna Sankar ksanka...@gmail.com 写道:

 Couldn't find any reference of CDH in pom.xml - profiles or the 
 hadoop.version.Am also wondering how the cdh compatible artifact was 
 compiled.
 Cheers
 k/


 On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S srikrishna...@gmail.com wrote:

 Hi All,

 Does anyone know what the command line arguments to mvn are to generate the 
 pre-built binary for spark on Hadoop 2-CHD5.

 I would like to pull in a recent bug fix in spark-master and rebuild the 
 binaries in the exact same way that was used for that provided on the 
 website.

 I have tried the following:

 mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1

 And it doesn't quite work.

 Any thoughts anyone?






Spark Installation

2014-07-07 Thread Srikrishna S
Hi All,

Does anyone know what the command line arguments to mvn are to generate the
pre-built binary for spark on Hadoop 2-CHD5.

I would like to pull in a recent bug fix in spark-master and rebuild the
binaries in the exact same way that was used for that provided on the
website.

I have tried the following:

mvn install -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1

And it doesn't quite work.

Any thoughts anyone?


Logistic Regression MLLib Slow

2014-06-04 Thread Srikrishna S
Hi All.,

I am new to Spark and I am trying to run LogisticRegression (with SGD)
using MLLib on a beefy single machine with about 128GB RAM. The dataset has
about 80M rows with only 4 features so it barely occupies 2Gb on disk.

I am running the code using all 8 cores with 20G memory using
spark-submit --executor-memory 20G --master local[8] logistic_regression.py

It seems to take about 3.5 hours without caching and over 5 hours with
caching.

What is the recommended use for Spark on a beefy single machine?

Any suggestions will help!

Regards,
Krishna


Code sample:
-
# Dataset
d = sys.argv[1]
data = sc.textFile(d)

# Load and parse the data
#
--
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return LabeledPoint(values[0], values[1:])
_parsedData = data.map(parsePoint)
parsedData = _parsedData.cache()
results = {}

# Spark
#
--
start_time = time.time()
# Build the gl_model
niters = 10
spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters)

# Evaluate the gl_model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label,
spark_model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())


Re: Logistic Regression MLLib Slow

2014-06-04 Thread Srikrishna S
I will try both and get back to you soon!

Thanks for all your help!

Regards,
Krishna


On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Krishna,

 Specifying executor memory in local mode has no effect, because all of
 the threads run inside the same JVM. You can either try
 --driver-memory 60g or start a standalone server.

 Best,
 Xiangrui

 On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng men...@gmail.com wrote:
  80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
  take that long, even on a single executor. Besides what Matei
  suggested, could you also verify the executor memory in
  http://localhost:4040 in the Executors tab. It is very likely the
  executors do not have enough memory. In that case, caching may be
  slower than reading directly from disk. -Xiangrui
 
  On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Ah, is the file gzipped by any chance? We can’t decompress gzipped
 files in
  parallel so they get processed by a single task.
 
  It may also be worth looking at the application UI (
 http://localhost:4040)
  to see 1) whether all the data fits in memory in the Storage tab (maybe
 it
  somehow becomes larger, though it seems unlikely that it would exceed
 20 GB)
  and 2) how many parallel tasks run in each iteration.
 
  Matei
 
  On Jun 4, 2014, at 6:56 PM, Srikrishna S srikrishna...@gmail.com
 wrote:
 
  I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
  running to only 10 iterations.
 
  The MLLib version of logistic regression doesn't seem to use all the
 cores
  on my machine.
 
  Regards,
  Krishna
 
 
 
  On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
 
  Are you using the logistic_regression.py in examples/src/main/python or
  examples/src/main/python/mllib? The first one is an example of writing
  logistic regression by hand and won’t be as efficient as the MLlib
 one. I
  suggest trying the MLlib one.
 
  You may also want to check how many iterations it runs — by default I
  think it runs 100, which may be more than you need.
 
  Matei
 
  On Jun 4, 2014, at 5:47 PM, Srikrishna S srikrishna...@gmail.com
 wrote:
 
   Hi All.,
  
   I am new to Spark and I am trying to run LogisticRegression (with
 SGD)
   using MLLib on a beefy single machine with about 128GB RAM. The
 dataset has
   about 80M rows with only 4 features so it barely occupies 2Gb on
 disk.
  
   I am running the code using all 8 cores with 20G memory using
   spark-submit --executor-memory 20G --master local[8]
   logistic_regression.py
  
   It seems to take about 3.5 hours without caching and over 5 hours
 with
   caching.
  
   What is the recommended use for Spark on a beefy single machine?
  
   Any suggestions will help!
  
   Regards,
   Krishna
  
  
   Code sample:
  
  
 -
   # Dataset
   d = sys.argv[1]
   data = sc.textFile(d)
  
   # Load and parse the data
   #
  
 --
   def parsePoint(line):
   values = [float(x) for x in line.split(',')]
   return LabeledPoint(values[0], values[1:])
   _parsedData = data.map(parsePoint)
   parsedData = _parsedData.cache()
   results = {}
  
   # Spark
   #
  
 --
   start_time = time.time()
   # Build the gl_model
   niters = 10
   spark_model = LogisticRegressionWithSGD.train(parsedData,
   iterations=niters)
  
   # Evaluate the gl_model on training data
   labelsAndPreds = parsedData.map(lambda p: (p.label,
   spark_model.predict(p.features)))
   trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
   float(parsedData.count())