You will get 10x speedup by not using mahout vector and use breeze sparse
vector from mllib in your mllib kmeans run....

@Xiangrui showed the comparison chart sometime back...
On May 14, 2014 6:33 AM, "Xiangrui Meng" <men...@gmail.com> wrote:

> You need
>
> > val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable])
>
> to load the data. After that, you can do
>
> > val data = raw.values.map(_.get)
>
> To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar`
> when you launch spark-shell to include mahout-math.
>
> Best,
> Xiangrui
>
> On Tue, May 13, 2014 at 10:37 PM, Stuti Awasthi <stutiawas...@hcl.com>
> wrote:
> > Hi All,
> >
> > I am very new to Spark and trying to play around with Mllib hence
> apologies
> > for the basic question.
> >
> >
> >
> > I am trying to run KMeans algorithm using Mahout and Spark MLlib to see
> the
> > performance. Now initial datasize was 10 GB. Mahout converts the data in
> > Sequence File <Text,VectorWritable> which is used for KMeans Clustering.
> > The Sequence File crated was ~ 6GB in size.
> >
> >
> >
> > Now I wanted if I can use the Mahout Sequence file to be executed in
> Spark
> > MLlib for KMeans . I have read that SparkContext.sequenceFile may be used
> > here. Hence I tried to read my sequencefile as below but getting the
> error :
> >
> >
> >
> > Command on Spark Shell :
> >
> > scala> val data = sc.sequenceFile[String,VectorWritable]("/
> > KMeans_dataset_seq/part-r-00000",String,VectorWritable)
> >
> > <console>:12: error: not found: type VectorWritable
> >
> >        val data = sc.sequenceFile[String,VectorWritable]("
> > /KMeans_dataset_seq/part-r-00000",String,VectorWritable)
> >
> >
> >
> > Here I have 2 ques:
> >
> > 1.  Mahout has “Text” as Key but Spark is printing “not found: type:Text”
> > hence I changed it to String.. Is this correct ???
> >
> > 2. How will VectorWritable be found in Spark. Do I need to include Mahout
> > jar in Classpath or any other option ??
> >
> >
> >
> > Please Suggest
> >
> >
> >
> > Regards
> >
> > Stuti Awasthi
> >
> >
> >
> > ::DISCLAIMER::
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> >
> > The contents of this e-mail and any attachment(s) are confidential and
> > intended for the named recipient(s) only.
> > E-mail transmission is not guaranteed to be secure or error-free as
> > information could be intercepted, corrupted,
> > lost, destroyed, arrive late or incomplete, or may contain viruses in
> > transmission. The e mail and its contents
> > (with or without referred errors) shall therefore not attach any
> liability
> > on the originator or HCL or its affiliates.
> > Views or opinions, if any, presented in this email are solely those of
> the
> > author and may not necessarily reflect the
> > views or opinions of HCL or its affiliates. Any form of reproduction,
> > dissemination, copying, disclosure, modification,
> > distribution and / or publication of this message without the prior
> written
> > consent of authorized representative of
> > HCL is strictly prohibited. If you have received this email in error
> please
> > delete it and notify the sender immediately.
> > Before opening any email and/or attachments, please check them for
> viruses
> > and other defects.
> >
> >
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>

Reply via email to