Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley : > Hi

Re: partitioning json data in spark

2015-12-27 Thread Igor Berman
have you tried to specify format of your output, might be parquet is default format? df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote: > Hey all! > I am willing to partition *json *data by a column

Re: Pattern type is incompatible with expected type

2015-12-27 Thread Ted Yu
Have you tried declaring RDD[ChildTypeOne] and writing separate functions for each sub-type ? Cheers On Sun, Dec 27, 2015 at 10:08 AM, pkhamutou wrote: > Hello, > > I have a such situation: > > abstract class SuperType {...} > case class ChildTypeOne(x: String) extends

Passing parameters to spark SQL

2015-12-27 Thread Ajaxx
Given a SQLContext (or HiveContext) is it possible to pass in parameters to a query. There are several reasons why this makes sense, including loss of data type during conversion to string, SQL injection, etc. But currently, it appears that SQLContext.sql() only takes a single parameter which is

Pattern type is incompatible with expected type

2015-12-27 Thread pkhamutou
Hello, I have a such situation: abstract class SuperType {...} case class ChildTypeOne(x: String) extends SuperType {.} case class ChildTypeTwo(x: String) extends SuperType {} than I have: val rdd1: RDD[SuperType] = sc./*some code*/.map(r => ChildTypeOne(r)) val rdd2: RDD[SuperType] =

Re: partitioning json data in spark

2015-12-27 Thread Ted Yu
Is upgrading to 1.5.x a possibility for you ? Cheers On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան wrote: > > http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > I did try but it all was in vain. > It is also explicitly

Re: partitioning json data in spark

2015-12-27 Thread Նարեկ Գալստեան
http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter I did try but it all was in vain. It is also explicitly written in api docs that it only supports Parquet. ​ Narek Galstyan Նարեկ Գալստյան On 27 December 2015 at 17:52, Igor Berman

Re: Pattern type is incompatible with expected type

2015-12-27 Thread Pavel Khamutou
But the idea is to keep at as RDD[SuperType] since i have implicit contention to add custom functionality to RDD. Like here: http://blog.madhukaraphatak.com/extending-spark-api/ Cheers On 27 December 2015 at 19:13, Ted Yu wrote: > Have you tried declaring RDD[ChildTypeOne]

Re: Passing parameters to spark SQL

2015-12-27 Thread Jeff Zhang
You can do it using scala string interpolation http://docs.scala-lang.org/overviews/core/string-interpolation.html On Mon, Dec 28, 2015 at 5:11 AM, Ajaxx wrote: > Given a SQLContext (or HiveContext) is it possible to pass in parameters > to a > query. There are several

Re: How to contribute by picking up starter bugs

2015-12-27 Thread lokeshkumar
Thanks a lot Jim, looking to forward to pick up some bugs. On Mon, Dec 28, 2015 at 8:42 AM, jiml [via Apache Spark User List] < ml-node+s1001560n25813...@n3.nabble.com> wrote: > You probably want to start on the dev list: > http://apache-spark-developers-list.1001551.n3.nabble.com/ > > I have

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
yes Sharing the execution flow 15/12/28 00:19:15 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/12/28 00:19:15 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala> import

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Jeff Zhang
See http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On Mon, Dec 28, 2015 at 2:00 PM, 顾亮亮 wrote: > Hi all, > > > > SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a useful > feature to save resources on yarn. > > We

DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
Hi, I am trying to join two dataframes and able to display the results in the console ater join. I am saving that data and and saving in the joined data in CSV format using spark-csv api . Its just saving the column names not data at all. Below is the sample code for the reference: spark-shell

Re: Stuck with DataFrame df.select("select * from table");

2015-12-27 Thread Gourav Sengupta
Should not df.select just have the column names? And sqlC.sql have the select statement? Therefore perhaps we could use: df.select("COLUMN1, COLUMN2") and sqlC.sql("select COLUMN1, COLUMN2 from tablename") Why would someone want to do a select on a dataframe after registering it as a table? I

Help: Driver OOM when shuffle large amount of data

2015-12-27 Thread kendal
My driver is running OOM with my 4T data set... I don't collect any data to driver. All what the program done is map - reduce - saveAsTextFile. But the partitions to be shuffled is quite large - 20K+. The symptom what I'm seeing the timeout when GetMapOutputStatuses from Driver. 15/12/24 02:04:21

Re: why one of Stage is into Skipped section instead of Completed

2015-12-27 Thread Prem Spark
Thank you Silvio for the update. On Sat, Dec 26, 2015 at 1:14 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Skipped stages result from existing shuffle output of a stage when > re-running a transformation. The executors will have the output of the > stage in their local dirs and

DataFrame Vs RDDs ... Which one to use When ?

2015-12-27 Thread Divya Gehlot
Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya

Re: Passing parameters to spark SQL

2015-12-27 Thread Michael Armbrust
The only way to do this for SQL is though the JDBC driver. However, you can use literal values without lossy/unsafe string conversions by using the DataFrame API. For example, to filter: import org.apache.spark.sql.functions._ df.filter($"columnName" === lit(value)) On Sun, Dec 27, 2015 at

Inconsistent behavior of randomSplit in YARN mode

2015-12-27 Thread Gaurav Kumar
Hi, I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though. *Code:* val rdd = sc.parallelize(1 to 100) val rdd2 = rdd.repartition(64) rdd.partitions.size rdd2.partitions.size val

Can anyone explain Spark behavior for below? Kudos in Advance

2015-12-27 Thread Prem Spark
Scenario1: val z = sc.parallelize(List("12","23","345",""),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y) res143: String = 10 Scenario2: val z = sc.parallelize(List("12","23","","345"),2) z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) =>

Re: Can anyone explain Spark behavior for below? Kudos in Advance

2015-12-27 Thread Jeff Zhang
Not sure what you try to do, but the result is correct. Scenario 2: Partition 1 ("12", "23") ("","12") => "0" ("0","23") => "1" Partition 2 ("","345") ("","") => "0" ("0","345") => "1" Final merge: ("1","1") => "11" On Mon, Dec 28, 2015 at 7:14 AM, Prem Spark

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Ted Yu
Can you confirm that file1df("COLUMN2") and file2df("COLUMN10") appeared in the output of joineddf.collect.foreach(println) ? Thanks On Sun, Dec 27, 2015 at 6:32 PM, Divya Gehlot wrote: > Hi, > I am trying to join two dataframes and able to display the results in the

Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread 顾亮亮
Hi all, SPARK-3174 (https://issues.apache.org/jira/browse/SPARK-3174) is a useful feature to save resources on yarn. We want to open this feature on our yarn cluster. I have a question about the version of shuffle service. I’m now using spark-1.5.1 (shuffle service). If I want to upgrade to

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
External shuffle service is backward compatible, so if you deployed 1.6 shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications. Thanks Saisai On Mon, Dec 28, 2015 at 2:33 PM, 顾亮亮 wrote: > Is it possible to support both spark-1.5.1 and spark-1.6.0 on

RE: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread 顾亮亮
Is it possible to support both spark-1.5.1 and spark-1.6.0 on one yarn cluster? From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: Monday, December 28, 2015 2:29 PM To: Jeff Zhang Cc: 顾亮亮; user@spark.apache.org; 刘骋昺 Subject: Re: Opening Dynamic Scaling Executors on Yarn Replace all the

Re: Pattern type is incompatible with expected type

2015-12-27 Thread pkhamutou
Thank you for you response! But this approach did not help. This one works: def check[T: ClassTag](rdd: List[T]) = rdd match { case rdd: List[Int] if classTag[T] == classTag[Int] => println("Int") case rdd: List[String] if classTag[T] == classTag[String] => println("String")

Re: DataFrame Save is writing just column names while saving

2015-12-27 Thread Divya Gehlot
Finally able to resolve the issue For sample example having small dataset , its creating some 200 files .. I was just doing the random file check in output directory and Alas ! was getting all column files Attaching the output files now .. Now another question arises why so many (200 output files)

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-27 Thread Saisai Shao
Replace all the shuffle jars and restart the NodeManager is enough, no need to restart NN. On Mon, Dec 28, 2015 at 2:05 PM, Jeff Zhang wrote: > See > http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > > > On Mon, Dec 28, 2015 at 2:00 PM,

partitioning json data in spark

2015-12-27 Thread Նարեկ Գալստեան
Hey all! I am willing to partition *json *data by a column name and store the result as a collection of json files to be loaded to another database. I could use spark's built in *partitonBy *function but it only outputs in parquet format which is not desirable for me. Could you suggest me a way