Re: ML classifier and data format for dataset with variable number of features

Xiangrui Meng Fri, 11 Jul 2014 17:52:22 -0700

You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379


Best,
Xiangrui

On Fri, Jul 11, 2014 at 5:12 PM, SK <skrishna...@gmail.com> wrote:
> Hi,
>
> I need to perform binary classification on an image dataset. Each image is a
> data point described by a Json object. The feature set for each image is a
> set of feature vectors, each feature vector corresponding to a distinct
> object in the image. For example, if an image has 5 objects, its feature set
> will have 5 feature vectors, whereas an image that has 3 objects will have a
> feature set consisting of 3 feature vectors. So  the number of feature
> vectors  may be different for different images, although  each feature
> vector has the same number of attributes. The classification depends on the
> features of the individual objects, so I cannot aggregate them all into a
> flat vector.
>
> I have looked through the Mllib examples and it appears that the libSVM data
> format and the LabeledData format that Mllib uses, require  all the points
> to have the same number of features and they read in a flat feature vector.
> I would like to know if any of the Mllib supervised learning classifiers can
> be used with json data format and whether they can be used to classify
> points with different number of features as described above.
>
> thanks
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/ML-classifier-and-data-format-for-dataset-with-variable-number-of-features-tp9486.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ML classifier and data format for dataset with variable number of features

Reply via email to