You can create tf vectors and then use RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth having a simple interface for it. -Xiangrui
On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi <stutiawas...@hcl.com> wrote: > Hi all, > > > > I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was > looking for the way to convert text into sparse vector with TFIDF weighting > scheme. > > I found that MLI library supports that but it is compatible with Spark 0.8. > > > > What are all the options available to achieve text vectorization. Is there > any pre-built api’s which can be used or other way in which we can achieve > this > > Please suggest > > > > Thanks > > Stuti Awasthi > > > > ::DISCLAIMER:: > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. > E-mail transmission is not guaranteed to be secure or error-free as > information could be intercepted, corrupted, > lost, destroyed, arrive late or incomplete, or may contain viruses in > transmission. The e mail and its contents > (with or without referred errors) shall therefore not attach any liability > on the originator or HCL or its affiliates. > Views or opinions, if any, presented in this email are solely those of the > author and may not necessarily reflect the > views or opinions of HCL or its affiliates. Any form of reproduction, > dissemination, copying, disclosure, modification, > distribution and / or publication of this message without the prior written > consent of authorized representative of > HCL is strictly prohibited. If you have received this email in error please > delete it and notify the sender immediately. > Before opening any email and/or attachments, please check them for viruses > and other defects. > > ----------------------------------------------------------------------------------------------------------------------------------------------------