If you are using DataFrames, then you also can specify the schema when loading as an alternate solution. I've found Spark-CSV <https://github.com/databricks/spark-csv> to be a very useful library when working with CSV data.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader On Mon, Mar 7, 2016 at 1:10 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > As you've pointed out, Rating requires user and item ids in Int form. So > you will need to map String user ids to integers. > > See this thread for example: > https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJgQjQ9GhGqpg1=hvxpfrs+59elfj9f7knhp8nyqnh1ut_6...@mail.gmail.com%3E > . > > There is a DeveloperApi method > in org.apache.spark.ml.recommendation.ALS that takes Rating with generic > type (can be String) for user id and item id. However that is a little more > involved, and for larger scale data will be a lot less efficient. > > Something like this for example: > > import org.apache.spark.ml.recommendation.ALS > import org.apache.spark.ml.recommendation.ALS.Rating > > val conf = new SparkConf().setAppName("ALSWithStringID").setMaster("local[4]") > val sc = new SparkContext(conf) > // Name,Value1,Value2. > val rdd = sc.parallelize(Seq( > Rating[String]("foo", "1", 4.0f), > Rating[String]("foo", "2", 2.0f), > Rating[String]("bar", "1", 5.0f), > Rating[String]("bar", "3", 1.0f) > )) > val (userFactors, itemFactors) = ALS.train(rdd) > > > As you can see, you just get the factor RDDs back, and if you want an > ALSModel you will have to construct it yourself. > > > On Sun, 6 Mar 2016 at 18:23 Shishir Anshuman <shishiranshu...@gmail.com> > wrote: > >> I am new to apache Spark, and I want to implement the Alternating Least >> Squares algorithm. The data set is stored in a csv file in the format: >> *Name,Value1,Value2*. >> >> When I read the csv file, I get >> *java.lang.NumberFormatException.forInputString* error because the >> Rating class needs the parameters in the format: *(user: Int, product: >> Int, rating: Double)* and the first column of my file contains *Name*. >> >> Please suggest me a way to overcome this issue. >> >