Launching multiple spark jobs within a main spark job.

2016-12-20 Thread Naveen
Hi Team,

Is it ok to spawn multiple spark jobs within a main spark job, my main
spark job's driver which was launched on yarn cluster, will do some
preprocessing and based on it, it needs to launch multilple spark jobs on
yarn cluster. Not sure if this right pattern.

Please share your thoughts.
Sample code i ve is as below for better understanding..
-

Object Mainsparkjob {

main(...){

val sc=new SparkContext(..)

Fetch from hive..using hivecontext
Fetch from hbase

//spawning multiple Futures..
Val future1=Future{
Val sparkjob= SparkLauncher(...).launch; spark.waitFor
}

Similarly, future2 to futureN.

future1.onComplete{...}
}

}// end of mainsparkjob
--


Re: How to get recent value in spark dataframe

2016-12-20 Thread Divya Gehlot
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html

Hope this helps


Thanks,
Divya

On 15 December 2016 at 12:49, Milin korath 
wrote:

> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>   a   0100  2015
>   a   050   2015
>   a   1200  2014
>   a   1300  2013
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>   a   0100  2015200
>   a   050   2015200
>   a   1200  2014null
>   a   1300  2013null
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
>
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>


Facing intermittent issue

2016-12-20 Thread Manisha Sethi
Hi All,

I am submitting few JOBS remotely using spark on YARN /SPARK standalone.
Jobs get submitted and run successfully, but all of sudden it gets throwing 
exception for days on same cluster:

StackTrace:


Set(); users  with modify permissions: Set(hadoop); groups with modify 
permissions: Set()

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 WARN Utils: Service 'sparkDriver' could not bind on port 0. 
Attempting port 1.

16/12/21 12:38:30 ERROR SparkContext: Error initializing SparkContext.

java.net.BindException: Cannot assign requested address: Service 'sparkDriver' 
failed after 16 retries! Consider explicitly setting the appropriate port for 
the service 'sparkDriver' (for example spark.ui.port for SparkUI) to an 
available port or increasing spark.port.maxRetries.

at sun.nio.ch.Net.bind0(Native Method)

at sun.nio.ch.Net.bind(Net.java:437)

at sun.nio.ch.Net.bind(Net.java:429)


Cant find anything relevant on stack overflow, tried hostname mapping in 
etc/host, but no help. Once it starts coming, job never runs. Found many open 
questions also but no concrete solution.

HELP

Manisha





Re: access Broadcast Variables in Spark java

2016-12-20 Thread Richard Xin
try this:JavaRDD mapr = listrdd.map(x -> broadcastVar.value().get(x));
 

On Wednesday, December 21, 2016 2:25 PM, Sateesh Karuturi 
 wrote:
 

 I need to process spark Broadcast variables using Java RDD API. This is my 
code what i have tried so far:This is only sample code to check whether its 
works or not? In my case i need to work on two csvfiles.
SparkConf conf = new 
SparkConf().setAppName("BroadcastVariable").setMaster("local");
  JavaSparkContext ctx = new JavaSparkContext(conf);
  Map map = new HashMap();
  map.put(1, "aa");
  map.put(2, "bb");
  map.put(9, "ccc");
  Broadcast> broadcastVar = ctx.broadcast(map);
  List list = new ArrayList();
  list.add(1);
  list.add(2);
  list.add(9);
  JavaRDD listrdd = ctx.parallelize(list);
  JavaRDD mapr = listrdd.map(x -> broadcastVar.value());
  System.out.println(mapr.collect());
and its prints output like this:[{1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}, 
{1=aa, 2=bb, 9=ccc}]
and my requirement is : [{aa, bb, ccc}]
Is it possible to do like in my required way?please help me out.


   

access Broadcast Variables in Spark java

2016-12-20 Thread Sateesh Karuturi
I need to process spark Broadcast variables using Java RDD API. This is my
code what i have tried so far:

This is only sample code to check whether its works or not? In my case i
need to work on two csvfiles.


SparkConf conf = new
SparkConf().setAppName("BroadcastVariable").setMaster("local");
  JavaSparkContext ctx = new JavaSparkContext(conf);
  Map map = new HashMap();
  map.put(1, "aa");
  map.put(2, "bb");
  map.put(9, "ccc");
  Broadcast> broadcastVar = ctx.broadcast(map);
  List list = new ArrayList();
  list.add(1);
  list.add(2);
  list.add(9);
  JavaRDD listrdd = ctx.parallelize(list);
  JavaRDD mapr = listrdd.map(x -> broadcastVar.value());
  System.out.println(mapr.collect());

and its prints output like this:

[{1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}, {1=aa, 2=bb, 9=ccc}]

and my requirement is :

 [{aa, bb, ccc}]

Is it possible to do like in my required way?

please help me out.


scikit-learn and mllib difference in predictions python

2016-12-20 Thread ioanna
I have an issue with an SVM model trained for binary classification using
Spark 2.0.0.
I have followed the same logic using scikit-learn and MLlib, using the exact
same dataset.
For scikit learn I have the following code:

svc_model = SVC()
svc_model.fit(X_train, y_train)

print "supposed to be 1"
print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
print
svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0])
print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0])

print "supposed to be 0"
print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
15.0, 15.0])
print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0])
print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0, 18.0,
7.0, 15.0])
print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
15.0, 7.0])


and it returns:

supposed to be 1
[0]
[1]
[1]
[1]
supposed to be 0
[0]
[0]
[0]
[0]

For spark am doing:

model_svm = SVMWithSGD.train(trainingData, iterations=100)

model_svm.clearThreshold()

print "supposed to be 1"
print
model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,4.0,12.0,8.0,0.0,7.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0))
print
model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.0,15.0))
print
model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,15.0))
   
print "supposed to be 0"
print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
15.0, 15.0, 15.0, 15.0))
print
model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,13.0,7.0,19.0,7.0,7.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0, 15.0,
15.0, 18.0, 7.0, 15.0))
print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
15.0, 15.0, 15.0, 7.0))

which returns:

supposed to be 1
12.8250120159
16.0786937313
14.2139435305
16.5115589658
supposed to be 0
17.1311777004
14.075461697
20.8883372052
12.9132580999

when I am setting the threshold I am either getting all zeros or all ones.

Does anyone know how to approach this problem?

As I said I have checked multiple times that my dataset and feature
extraction logic are exactly the same in both cases.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-predictions-python-tp28240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread satyajit vegesna
Hi All,

PFB sample code ,

val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd


def comp(zipcode:String):Unit={

val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl",
zipcode)
val data = spark.sql(zipval) //Throwing null pointer exception with RDD
data.write.parquet(..)

}

val sam = zip.map(x => comp(x))
sam.count

But when i do val zip = df.select("zip_code").distinct().as[String].rdd.collect
and call the function, then i get data computer, but in sequential order.

I would like to know, why when tried running map with rdd, i get null
pointer exception and is there a way to compute the comp function for each
zipcode in parallel ie run multiple zipcode at the same time.

Any clue or inputs are appreciated.

Regards.


[no subject]

2016-12-20 Thread satyajit vegesna
Hi All,

PFB sample code ,

val df = spark.read.parquet()
df.registerTempTable("df")
val zip = df.select("zip_code").distinct().as[String].rdd


def comp(zipcode:String):Unit={

val zipval = "SELECT * FROM df WHERE
zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode)
val data = spark.sql(zipval) //Throwing null pointer exception with RDD
data.write.parquet(..)

}

val sam = zip.map(x => comp(x))
sam.count

But when i do val zip =
df.select("zip_code").distinct().as[String].rdd.collect and call the
function, then i get data computer, but in sequential order.

I would like to know, why when tried running map with rdd, i get null
pointer exception and is there a way to compute the comp function for each
zipcode in parallel ie run multiple zipcode at the same time.

Any clue or inputs are appreciated.

Regards.


Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
I think limit repartitions your data into a single partition if called as a non 
terminal operator. Hence zip works after limit because you only have one 
partition.

In practice, I have found joins to be much more applicable than zip because of 
the strict limitation of identical partitions.

https://richardstartin.com

On 20 Dec 2016, at 16:04, Jack Wenger 
> wrote:

Hello,

I'm facing a strange behaviour with Spark 1.5.0 (Cloudera 5.5.1).
I'm loading data from Hive with HiveContext (~42M records) and then try to add 
a new column with "withColumn" and a UDF.
Finally i'm suppose to create a new Hive table from this dataframe.


Here is the code :

_
_


DATETIME_TO_COMPARE = "-12-31 23:59:59.99"

myFunction = udf(lambda col: 0 if col != DATETIME_TO_COMPARE else 1, 
IntegerType())

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue")

df2 = df1.withColumn("myNewCol", myFunction(df1.col3))
df2.registerTempTable("df2")

hc.sql("create table my_db.new_table as select * from df2")

_
_


But I get this error :


py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in 
stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 
186, lxpbda25.ra1.intra.groupama.fr): 
org.apache.spark.SparkException: Can only zip RDDs with same number of elements 
in each partition
at 
org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.hasNext(RDD.scala:832)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:104)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:85)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




What is suprising is that if I modify the select statement by addind a LIMIT 
1 (which is more than twice the number of records in my table), then 
it's working :

_
_

df1 = hc.sql("SELECT col1, col2, col3,col4,col5,col6,col7 FROM myTable WHERE 
col4 == someValue" LIMIT 1)

_
_

In both cases, if I run a count() on df1, I'm getting the same number : 42 593 
052

Is it a bug or am I missing something ?
If it is not a bug, what am I doing wrong ?


Thank you !


Jack


Re: How to deal with string column data for spark mlib?

2016-12-20 Thread big data
I want to use decision tree to evaluate whether the event will be happened, the 
data like this:

userid sexcountry   ageattr1  attr2   ...   event

1   male USA   23  xxx   0

2   male UK   25  xxx   1

3   female   JPN   35  xxx   1

...

I want to use sex, country, age, attr1, attr2, ... as input, and event column 
as the label column to be applied to decision tree.

In spark mlib, I get that all  columns value should be double to be calculated,

But I do not know to transfer sex, country, attr1, attr2 columns' value to 
double type directly in spark's job.


thanks.

在 16/12/20 下午9:37, theodondre 写道:
Give a snippets of the data.



Sent from my T-Mobile 4G LTE Device


 Original message 
From: big data 
Date: 12/20/16 4:35 AM (GMT-05:00)
To: user@spark.apache.org
Subject: How to deal with string column data for spark mlib?

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



RE: How to deal with string column data for spark mlib?

2016-12-20 Thread theodondre


Give a snippets of the data.


Sent from my T-Mobile 4G LTE Device

 Original message 
From: big data  
Date: 12/20/16  4:35 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: How to deal with string column data for spark mlib? 

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbb    ccc
aa2   bb2    cc2
aa3   bb3    cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



撤回: How to deal with string column data for spark mlib?

2016-12-20 Thread Triones,Deng(vip.com)
邓刚[技术中心] 将撤回邮件“How to deal with string column data for spark mlib?”。
本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人,谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容,或将其用于其他任何目的或向任何人披露。谢谢您的合作!
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.


答复: How to deal with string column data for spark mlib?

2016-12-20 Thread Triones,Deng(vip.com)
Hi spark dev,

 I am using spark 2 to write orc file to hdfs. I have one questions 
about the savemode.

 My use case is this. When I write data into hdfs. If one task failed I 
hope the file that the task created should be delete and the retry task can 
write all data, that is to say,
If I have the data 1 to 100 in this task, when the task which write 1 to 100 
failed at first, then the task scheduler reschedule the partition task , the 
data in hdfs should only have the data 1 to 100. Not double 1 and so on.

If so which kind  of savemode should I use. I the FileFormatWriter.scala the 
file name rule contains one UUID,so I am in mistake..


Thanks


Triones

本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人,谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容,或将其用于其他任何目的或向任何人披露。谢谢您的合作!
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.


question about the data frame save mode to make the data exactly one

2016-12-20 Thread Triones,Deng(vip.com)

Hi spark dev,

 I am using spark 2 to write orc file to hdfs. I have one questions 
about the savemode.

 My use case is this. When I write data into hdfs. If one task failed I 
hope the file that the task created should be delete and the retry task can 
write all data, that is to say,
If I have the data 1 to 100 in this task, when the task which write 1 to 100 
failed at first, then the task scheduler reschedule the partition task , the 
data in hdfs should only have the data 1 to 100. Not double 1 and so on.

If so which kind  of savemode should I use. I the FileFormatWriter.scala the 
file name rule contains one UUID,so I am in mistake..


Thanks


Triones

本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人,谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容,或将其用于其他任何目的或向任何人披露。谢谢您的合作!
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.


Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
@Deepak,
This conversion is not suitable for categorical data. But again as I mentioned 
its all dependent on nature of data and what is intended by OP

Consider you want to convert race into numbers (races as black, white and asian)
So, you want numerical variables, and you could just assign a number to each 
race. But, if you choose White=1, Black=2, Asian=3 then does it really make 
sense that the distance between White's and Black's is exactly half the 
distance between White's and Asian's? And, is that ordering even correct? 
Probably not.


Instead, what you do is create dummy variables. Let's say you have just those 
three races. Then, you create two dummy variables: White, Black. You could also 
use White, Asian or Black, Asian; the key is that you always create one fewer 
dummy variables then categories. Now, the White variable is 1 if the individual 
is white and is 0 otherwise, and the Black variable is 1 if the individual is 
black and is 0 otherwise. If you now fit a regression model, the coefficient 
for White tells you the average difference between asians and whites (note that 
the Asian dummy variable was not used, so asians become the baseline we compare 
to). The coefficient for Black tells you the average difference between asians 
and blacks.

Rohit
On Dec 20, 2016, at 3:15 PM, Deepak Sharma 
> wrote:

You can read the source in a data frame.
Then iterate over all rows with map and use something like below:
df.map(x=>x(0).toString().toDouble)

Thanks
Deepak

On Tue, Dec 20, 2016 at 3:05 PM, big data 
> wrote:
our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net



Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame.
Then iterate over all rows with map and use something like below:
df.map(x=>x(0).toString().toDouble)

Thanks
Deepak

On Tue, Dec 20, 2016 at 3:05 PM, big data  wrote:

> our source data are string-based data, like this:
> col1   col2   col3 ...
> aaa   bbbccc
> aa2   bb2cc2
> aa3   bb3cc3
> ... ...   ...
>
> How to convert all of these data to double to apply for mlib's algorithm?
>
> thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
There are various techniques but the actual answer will depend on what you are 
trying to do, kind of input data, nature of algorithm.
You can browse through 
https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
this should give you a starting hint.
On Dec 20, 2016, at 3:05 PM, big data 
> wrote:

our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




How to deal with string column data for spark mlib?

2016-12-20 Thread big data
our source data are string-based data, like this:
col1   col2   col3 ...
aaa   bbbccc
aa2   bb2cc2
aa3   bb3cc3
... ...   ...

How to convert all of these data to double to apply for mlib's algorithm?

thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org