Hi all,

I've been recently playing with the ml API in spark 1.6.0 as I'm in the
process of implementing a series of new classifiers for my Phd thesis. There
are some questions that have arisen regarding the scalability of the
different data pipelines that can be used to load the training datasets into
the ML algorithms.

I used to load the data into spark by creating a raw RDD from a text file
(csv data points) and parse them directly into LabeledPoints to be used with
the learning algorithms.

Right now, I see that the spark ml API is built on top of Dataframes, so
I've changed my scripts to load data into this data structure by using the
databricks csv connector. However the speed of this methods seems to be 2 or
4 times slower that the original one. I can understand that creating a
Dataframe can be costly for the systems, although using a pre-specified
schema. In addition, it seems that working with the "Row" class is also
slower than accessing the data inside LabeledPoints.

My last point is that it seems that all the proposed algorithms work with
Dataframes in which the rows are composed of <label: Int, features :
DenseVector> Rows, which imply an additional parsing through the data when
loading from a text file.

I'm sorry if I'm just messing some concepts from the documentation, but
after an intensive experimentation I don't really see a clear strategy to
use these different elements. Any thoughts would be really appreciated :)

Cheers,

jarias



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ml-Dataframe-vs-Labeled-Point-RDD-Mllib-speed-tp25995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to