Launching multiple spark jobs within a main spark job.

2016-12-20 Thread Naveen
Hi Team, Is it ok to spawn multiple spark jobs within a main spark job, my main spark job's driver which was launched on yarn cluster, will do some preprocessing and based on it, it needs to launch multilple spark jobs on yarn cluster. Not sure if this right pattern. Please share your thoughts.

Re: How to get recent value in spark dataframe

2016-12-20 Thread Divya Gehlot
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html Hope this helps Thanks, Divya On 15 December 2016 at 12:49, Milin korath wrote: > Hi > > I have a spark data frame with following structure > > id flag price date > a 0

Facing intermittent issue

2016-12-20 Thread Manisha Sethi
Hi All, I am submitting few JOBS remotely using spark on YARN /SPARK standalone. Jobs get submitted and run successfully, but all of sudden it gets throwing exception for days on same cluster: StackTrace: Set(); users with modify permissions: Set(hadoop); groups with modify permissions:

Re: access Broadcast Variables in Spark java

2016-12-20 Thread Richard Xin
try this:JavaRDD mapr = listrdd.map(x -> broadcastVar.value().get(x)); On Wednesday, December 21, 2016 2:25 PM, Sateesh Karuturi wrote: I need to process spark Broadcast variables using Java RDD API. This is my code what i have tried so far:This is only

access Broadcast Variables in Spark java

2016-12-20 Thread Sateesh Karuturi
I need to process spark Broadcast variables using Java RDD API. This is my code what i have tried so far: This is only sample code to check whether its works or not? In my case i need to work on two csvfiles. SparkConf conf = new SparkConf().setAppName("BroadcastVariable").setMaster("local");

scikit-learn and mllib difference in predictions python

2016-12-20 Thread ioanna
I have an issue with an SVM model trained for binary classification using Spark 2.0.0. I have followed the same logic using scikit-learn and MLlib, using the exact same dataset. For scikit learn I have the following code: svc_model = SVC() svc_model.fit(X_train, y_train) print

Null pointer exception with RDD while computing a method, creating dataframe.

2016-12-20 Thread satyajit vegesna
Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data =

[no subject]

2016-12-20 Thread satyajit vegesna
Hi All, PFB sample code , val df = spark.read.parquet() df.registerTempTable("df") val zip = df.select("zip_code").distinct().as[String].rdd def comp(zipcode:String):Unit={ val zipval = "SELECT * FROM df WHERE zip_code='$zipvalrepl'".replace("$zipvalrepl", zipcode) val data =

Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
I think limit repartitions your data into a single partition if called as a non terminal operator. Hence zip works after limit because you only have one partition. In practice, I have found joins to be much more applicable than zip because of the strict limitation of identical partitions.

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread big data
I want to use decision tree to evaluate whether the event will be happened, the data like this: userid sexcountry ageattr1 attr2 ... event 1 male USA 23 xxx 0 2 male UK 25 xxx 1 3

RE: How to deal with string column data for spark mlib?

2016-12-20 Thread theodondre
Give a snippets of the data. Sent from my T-Mobile 4G LTE Device Original message From: big data Date: 12/20/16 4:35 AM (GMT-05:00) To: user@spark.apache.org Subject: How to deal with string column data for spark mlib? our source data are

撤回: How to deal with string column data for spark mlib?

2016-12-20 Thread Triones,Deng(vip.com)
邓刚[技术中心] 将撤回邮件“How to deal with string column data for spark mlib?”。 本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人,谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容,或将其用于其他任何目的或向任何人披露。谢谢您的合作! This communication is intended only for the addressee(s) and may contain information that is privileged and confidential. You are

答复: How to deal with string column data for spark mlib?

2016-12-20 Thread Triones,Deng(vip.com)
Hi spark dev, I am using spark 2 to write orc file to hdfs. I have one questions about the savemode. My use case is this. When I write data into hdfs. If one task failed I hope the file that the task created should be delete and the retry task can write all data, that is to

question about the data frame save mode to make the data exactly one

2016-12-20 Thread Triones,Deng(vip.com)
Hi spark dev, I am using spark 2 to write orc file to hdfs. I have one questions about the savemode. My use case is this. When I write data into hdfs. If one task failed I hope the file that the task created should be delete and the retry task can write all data, that is to

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
@Deepak, This conversion is not suitable for categorical data. But again as I mentioned its all dependent on nature of data and what is intended by OP Consider you want to convert race into numbers (races as black, white and asian) So, you want numerical variables, and you could just assign a

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame. Then iterate over all rows with map and use something like below: df.map(x=>x(0).toString().toDouble) Thanks Deepak On Tue, Dec 20, 2016 at 3:05 PM, big data wrote: > our source data are string-based data, like this: > col1

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Rohit Verma
There are various techniques but the actual answer will depend on what you are trying to do, kind of input data, nature of algorithm. You can browse through https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/ this should give you a starting

How to deal with string column data for spark mlib?

2016-12-20 Thread big data
our source data are string-based data, like this: col1 col2 col3 ... aaa bbbccc aa2 bb2cc2 aa3 bb3cc3 ... ... ... How to convert all of these data to double to apply for mlib's algorithm? thanks.