Re: Using String Dataset for Logistic Regression

2014-06-03 Thread praveshjain1991
I am not sure. I have just been using some numerical datasets.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6784.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using String Dataset for Logistic Regression

2014-06-03 Thread Xiangrui Meng
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui

On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991
praveshjain1...@gmail.com wrote:
 I am not sure. I have just been using some numerical datasets.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6784.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Using String Dataset for Logistic Regression

2014-06-02 Thread Wush Wu
Dear all,

Does spark support sparse matrix/vector for LR now?

Best,
Wush
2014/6/2 下午3:19 於 praveshjain1991 praveshjain1...@gmail.com 寫道:

 Thank you for your replies. I've now been using integer datasets but ran
 into
 another issue.


 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-td6694.html

 Any ideas?

 --
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6695.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Using String Dataset for Logistic Regression

2014-05-16 Thread Brian Gawalt
Pravesh,

Correct, the logistic regression engine is set up to perform classification
tasks that take feature vectors (arrays of real-valued numbers) that are
given a class label, and learning a linear combination of those features
that divide the classes. As the above commenters have mentioned, there's
lots of different ways to turn string data into feature vectors. 

For instance, if you're classifying documents between, say, spam or valid
email, you may want to start with a bag-of-words model
(http://en.wikipedia.org/wiki/Bag-of-words_model ) or the rescaled variant
TF-IDF ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf ). You'd turn a single
document into a single, high-dimensional, sparse vector whose element j
encodes the number of appearance term j. Maybe you want to try the
experiment by featurizing on bigrams, trigrams, etc...

Or if you're just trying to tell english language tweets from non-english
language tweets, in which case the bag of words might be overkill: you
could instead try featurizing on just the counts of each pair of consecutive
characters. E.g., the first element counts aa appearances, then the second
ab, then zy then zz. Those will be smaller feature vectors,
capturing less information, but it's probably sufficient for the simpler
task, and you'll be able to fit the model with less data than trying to fit
a whole-word-based model.

Different applications are going to need more or less context from your
strings -- whole words? n-grams? just characters? treat them as ENUMs as in
the days of week example? -- so it might not make sense for Spark to come
with a direct way to turn a string attribute into a vector for use in
logistic regression. You'll have to settle on the featurization approach
that's right for your domain before you try training the logistic regression
classifier on your labelled feature vectors. 

Best,
-Brian





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p5882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.