Hi all, I've been recently playing with the ml API in spark 1.6.0 as I'm in the process of implementing a series of new classifiers for my Phd thesis. There are some questions that have arisen regarding the scalability of the different data pipelines that can be used to load the training datasets into the ML algorithms.
I used to load the data into spark by creating a raw RDD from a text file (csv data points) and parse them directly into LabeledPoints to be used with the learning algorithms. Right now, I see that the spark ml API is built on top of Dataframes, so I've changed my scripts to load data into this data structure by using the databricks csv connector. However the speed of this methods seems to be 2 or 4 times slower that the original one. I can understand that creating a Dataframe can be costly for the systems, although using a pre-specified schema. In addition, it seems that working with the "Row" class is also slower than accessing the data inside LabeledPoints. My last point is that it seems that all the proposed algorithms work with Dataframes in which the rows are composed of <label: Int, features : DenseVector> Rows, which imply an additional parsing through the data when loading from a text file. I'm sorry if I'm just messing some concepts from the documentation, but after an intensive experimentation I don't really see a clear strategy to use these different elements. Any thoughts would be really appreciated :) Cheers, jarias -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ml-Dataframe-vs-Labeled-Point-RDD-Mllib-speed-tp25995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org