Re: How to deal with string column data for spark mlib?

2016-12-20 Thread big data
I want to use decision tree to evaluate whether the event will be happened, the 
data like this:

userid sexcountry   ageattr1  attr2   ...   event

1   male USA   23  xxx   0

2   male UK   25  xxx   1

3   female   JPN   35  xxx   1

...

I want to use sex, country, age, attr1, attr2, ... as input, and event column 
as the label column to be applied to decision tree.

In spark mlib, I get that all  columns value should be double to be calculated,

But I do not know to transfer sex, country, attr1, attr2 columns' value to 
double type directly in spark's job.


thanks.

在 16/12/20 下午9:37, theodondre 写道:
Give a snippets of the data.



Sent from my T-Mobile 4G LTE Device


 Original message 
From: big data 
Date: 12/20/16 4:35 AM (GMT-05:00)
To: user@spark.apache.org
Subject: How to deal with string column data for spark mlib?

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



RE: How to deal with string column data for spark mlib?

2016-12-20 Thread theodondre


Give a snippets of the data.


Sent from my T-Mobile 4G LTE Device

 Original message 
From: big data  
Date: 12/20/16  4:35 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: How to deal with string column data for spark mlib? 

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbb    ccc
aa2   bb2    cc2
aa3   bb3    cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
@Deepak,
This conversion is not suitable for categorical data. But again as I mentioned 
its all dependent on nature of data and what is intended by OP

Consider you want to convert race into numbers (races as black, white and asian)
So, you want numerical variables, and you could just assign a number to each 
race. But, if you choose White=1, Black=2, Asian=3 then does it really make 
sense that the distance between White's and Black's is exactly half the 
distance between White's and Asian's? And, is that ordering even correct? 
Probably not.


Instead, what you do is create dummy variables. Let's say you have just those 
three races. Then, you create two dummy variables: White, Black. You could also 
use White, Asian or Black, Asian; the key is that you always create one fewer 
dummy variables then categories. Now, the White variable is 1 if the individual 
is white and is 0 otherwise, and the Black variable is 1 if the individual is 
black and is 0 otherwise. If you now fit a regression model, the coefficient 
for White tells you the average difference between asians and whites (note that 
the Asian dummy variable was not used, so asians become the baseline we compare 
to). The coefficient for Black tells you the average difference between asians 
and blacks.

Rohit
On Dec 20, 2016, at 3:15 PM, Deepak Sharma 
> wrote:

You can read the source in a data frame.
Then iterate over all rows with map and use something like below:
df.map(x=>x(0).toString().toDouble)

Thanks
Deepak

On Tue, Dec 20, 2016 at 3:05 PM, big data 
> wrote:
our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net



Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame.
Then iterate over all rows with map and use something like below:
df.map(x=>x(0).toString().toDouble)

Thanks
Deepak

On Tue, Dec 20, 2016 at 3:05 PM, big data  wrote:

> our source data are string-based data, like this:
> col1   col2   col3 ...
> aaa   bbbccc
> aa2   bb2cc2
> aa3   bb3cc3
> ... ...   ...
>
> How to convert all of these data to double to apply for mlib's algorithm?
>
> thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
There are various techniques but the actual answer will depend on what you are 
trying to do, kind of input data, nature of algorithm.
You can browse through 
https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
this should give you a starting hint.
On Dec 20, 2016, at 3:05 PM, big data 
> wrote:

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org