Using a different spark jars than the one on the cluster

2015-03-18 Thread jaykatukuri
Hi all,
I am trying to run my job which needs spark-sql_2.11-1.3.0.jar. 
The cluster that I am running on is still on spark-1.2.0.

I tried the following :

spark-submit --class class-name --num-executors 100 --master yarn 
application_jar--jars hdfs:///path/spark-sql_2.11-1.3.0.jar
hdfs:///input_data 

But, this did not work, I get an error that it is not able to find a
class/method that is in spark-sql_2.11-1.3.0.jar .

org.apache.spark.sql.SQLContext.implicits()Lorg/apache/spark/sql/SQLContext$implicits$

The question in general is how do we use a different version of spark jars
(spark-core, spark-sql, spark-ml etc) than the one's running on a cluster ?

Thanks,
Jay





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-different-spark-jars-than-the-one-on-the-cluster-tp22125.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD to DataFrame for using ALS under org.apache.spark.ml.recommendation.ALS

2015-03-16 Thread jaykatukuri
Hi all,
I am trying to use the new ALS implementation under
org.apache.spark.ml.recommendation.ALS.



The new method to invoke for training seems to be  override def fit(dataset:
DataFrame, paramMap: ParamMap): ALSModel.

How do I create a dataframe object from ratings data set that is on hdfs ?


where as the method in the old ALS implementation under
org.apache.spark.mllib.recommendation.ALS was 
 def train(
  ratings: RDD[Rating],
  rank: Int,
  iterations: Int,
  lambda: Double,
  blocks: Int,
  seed: Long
): MatrixFactorizationModel

My code to run the old ALS train method is as below:

 "val sc = new SparkContext(conf) 
 
 val pfile = args(0)
 val purchase=sc.textFile(pfile)
val ratings = purchase.map(_.split(',') match { case Array(user, item,
rate) =>
Rating(user.toInt, item.toInt, rate.toInt)
})

val model = ALS.train(ratings, rank, numIterations, 0.01)"


Now, for the new ALS fit method, I am trying to use the below code to run,
but getting a compilation error:

val als = new ALS()
   .setRank(rank)
  .setRegParam(regParam)
  .setImplicitPrefs(implicitPrefs)
  .setNumUserBlocks(numUserBlocks)
  .setNumItemBlocks(numItemBlocks)

val sc = new SparkContext(conf) 
 
 val pfile = args(0)
 val purchase=sc.textFile(pfile)
val ratings = purchase.map(_.split(',') match { case Array(user, item,
rate) =>
Rating(user.toInt, item.toInt, rate.toInt)
})

val model = als.fit(ratings.toDF())

I get an error that the method toDF() is not a member of
org.apache.spark.rdd.RDD[org.apache.spark.ml.recommendation.ALS.Rating[Int]].

Appreciate the help !

Thanks,
Jay






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-DataFrame-for-using-ALS-under-org-apache-spark-ml-recommendation-ALS-tp22083.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-08 Thread jaykatukuri
Hi all,

I am running into an out of memory error while running ALS using MLLIB on a
reasonably small data set consisting of around 6 Million ratings.

The stack trace is below:

java.lang.OutOfMemoryError: Java heap space
at org.jblas.DoubleMatrix.(DoubleMatrix.java:323)
at org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)
at org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)
at
org.apache.spark.mllib.recommendation.ALS$$anonfun$21.apply(ALS.scala:465)
at
org.apache.spark.mllib.recommendation.ALS$$anonfun$21.apply(ALS.scala:465)
at scala.Array$.fill(Array.scala:267)
at
org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:465)
at
org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:445)
at
org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:444)
at
org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
at
org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

I am using 2GB for executors memory.  I tried with 100 executors.

Can some one please point me in the right direction ?

Thanks,
Jay





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org