Convert text into tfidf vectors for Classification

2014-06-13 Thread Stuti Awasthi
Hi all,

I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was 
looking for the way to convert text into sparse vector with TFIDF weighting 
scheme.
I found that MLI library supports that but it is compatible with Spark 0.8.

What are all the options available to achieve text vectorization. Is there any 
pre-built api's which can be used or other way in which we can achieve this
Please suggest

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Re: Convert text into tfidf vectors for Classification

2014-06-13 Thread Xiangrui Meng
You can create tf vectors and then use
RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For
tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth
having a simple interface for it. -Xiangrui

On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
 Hi all,



 I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was
 looking for the way to convert text into sparse vector with TFIDF weighting
 scheme.

 I found that MLI library supports that but it is compatible with Spark 0.8.



 What are all the options available to achieve text vectorization. Is there
 any pre-built api’s which can be used or other way in which we can achieve
 this

 Please suggest



 Thanks

 Stuti Awasthi



 ::DISCLAIMER::
 

 The contents of this e-mail and any attachment(s) are confidential and
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as
 information could be intercepted, corrupted,
 lost, destroyed, arrive late or incomplete, or may contain viruses in
 transmission. The e mail and its contents
 (with or without referred errors) shall therefore not attach any liability
 on the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of the
 author and may not necessarily reflect the
 views or opinions of HCL or its affiliates. Any form of reproduction,
 dissemination, copying, disclosure, modification,
 distribution and / or publication of this message without the prior written
 consent of authorized representative of
 HCL is strictly prohibited. If you have received this email in error please
 delete it and notify the sender immediately.
 Before opening any email and/or attachments, please check them for viruses
 and other defects.