Re: Aggregated column name

2017-03-23 Thread Wen Pei Yu
Thanks. Kevin This works for one or two column agg. But not work for this: val expr = (Map("forCount" -> "count") ++ features.map((_ -> "mean"))) val averageDF = originalDF .withColumn("forCount", lit(0)) .groupBy(col("...")) .agg(expr) Yu Wenpei. From: Kevin Mellott

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
Ryan, you are right. That was issue. It works now. Thanks. On Thu, Mar 23, 2017 at 8:26 PM, Ryan wrote: > you should import either spark.implicits or sqlContext.implicits, not > both. Otherwise the compiler will be confused about two implicit > transformations > >

Re: How to load "kafka" as a data source

2017-03-23 Thread Deepu Raj
Please check tools ver are same throughout. Thanks Deepu On Fri, 24 Mar 2017 14:47:11 +1100, Gaurav1809 wrote: > Hi All, > > I am running a simple command on spark-shell - like this. It's a piece of > structured streaming. > > val lines = (spark >.readStream >

How to load "kafka" as a data source

2017-03-23 Thread Gaurav1809
Hi All, I am running a simple command on spark-shell - like this. It's a piece of structured streaming. val lines = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "test") .load() .selectExpr("CAST(value AS STRING)")

Re: Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Ryan
no you don't need one hot. but since the feature column is a vector and vector only accepts numbers, if your feature is string then a StringIndexer is needed. http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier here's an example. On Thu, Mar 23, 2017 at

Re: Converting dataframe to dataset question

2017-03-23 Thread Ryan
you should import either spark.implicits or sqlContext.implicits, not both. Otherwise the compiler will be confused about two implicit transformations following code works for me, spark version 2.1.0 object Test { def main(args: Array[String]) { val spark = SparkSession .builder

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
Thanks a million Yong. Great help!!! It solved my problem. On Thu, Mar 23, 2017 at 6:00 PM, Yong Zhang wrote: > Change: > > val arrayinput = input.getAs[Array[String]](0) > > to: > > val arrayinput = input.getAs[*Seq*[String]](0) > > > Yong > > >

Re: Aggregated column name

2017-03-23 Thread Kevin Mellott
I'm not sure of the answer to your question; however, when performing aggregates I find it useful to specify an *alias* for each column. That will give you explicit control over the name of the resulting column. In your example, that would look something like:

Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
Hi Mathieu, I'm CCing the Spark user list since this will be of general interest to the forum. Unfortunately, there is not a way to begin LDA training with an existing model currently. Some MLlib models have been augmented to support specifying an "initialModel" argument, but LDA does not have

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread Yong Zhang
Change: val arrayinput = input.getAs[Array[String]](0) to: val arrayinput = input.getAs[Seq[String]](0) Yong From: shyla deshpande Sent: Thursday, March 23, 2017 8:18 PM To: user Subject: Spark dataframe,

Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
This is my input data. The UDAF needs to aggregate the goals for a team and return a map that gives the count for every goal in the team. I am getting the following error java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String; at

Re: Persist RDD doubt

2017-03-23 Thread sjayatheertha
Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. > On Mar 23, 2017, at 4:11 AM, nayan sharma wrote: > > In case of task failures,does spark clear the

Re: how to read object field within json file

2017-03-23 Thread Yong Zhang
That's why your "source" should be defined as an Array[Struct] type (which makes sense in this case, it has an undetermined length , so you can explode it and get the description easily. Now you need write your own UDF, maybe can do what you want. Yong From:

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
I made the code even more simpler still getting the error error: value toDF is not a member of Seq[com.whil.batch.Teamuser] [ERROR] val df = Seq(Teamuser("t1","u1","r1")).toDF() object Test { def main(args: Array[String]) { val spark = SparkSession .builder

how to read object field within json file

2017-03-23 Thread Selvam Raman
Hi, { "id": "test1", "source": { "F1": { "id": "4970", "eId": "F1", "description": "test1", }, "F2": { "id": "5070", "eId": "F2", "description": "test2", }, "F3": { "id": "5170", "eId": "F3", "description": "test3", },

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
now I get a run time error... error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. [ERROR] val

Re: Converting dataframe to dataset question

2017-03-23 Thread Yong Zhang
Not sure I understand this problem, why I cannot reproduce it? scala> spark.version res22: String = 2.1.0 scala> case class Teamuser(teamid: String, userid: String, role: String) defined class Teamuser scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF df: org.apache.spark.sql.DataFrame =

Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
I realized, my case class was inside the object. It should be defined outside the scope of the object. Thanks On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande wrote: > Why userDS is Dataset[Any], instead of Dataset[Teamuser]? Appreciate your > help. Thanks > >

[ANNOUNCE] Apache Gora 0.7 Release

2017-03-23 Thread lewis john mcgibbney
Hi Folks, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.7. The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and

Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
Ok, Thanks for your answers On 3/22/17, 1:34 PM, "Cody Koeninger" wrote: If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
I usually advocate a JIRA even for small stuff but for doc only change like this it's ok to submit a PR directly with [MINOR] in title. On Thu, 23 Mar 2017 at 06:55, chris snow wrote: > Thanks Nick. If this will help other users, I'll create a JIRA and > send a patch. > >

Application kill from UI do not propagate exception

2017-03-23 Thread Noorul Islam Kamal Malmiyoda
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul

Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Aseem Bansal
I was reading http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest and found that needs to be done in sklearn. Is that required in spark?

[PySpark] - Binary File Partition

2017-03-23 Thread jjayadeep
Hi, I am using Spark 1.6.2 and is there a known bug where number of partitions will always be 2 when minPartitions is not specified as below images = sc.binaryFiles("s3n://AKIAIOJYJILW24BQSIEA:txGkP6YcOHTjBNHPLFbbgmxPfkVQoyUktsVCVKaf@imagefiles-gok/locofiles-data/") I was looking at the source

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread chris snow
Thanks Nick. If this will help other users, I'll create a JIRA and send a patch. On 23 March 2017 at 13:49, Nick Pentreath wrote: > Yup, that is true and a reasonable clarification of the doc. > > On Thu, 23 Mar 2017 at 00:03 chris snow wrote: >>

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
>From the section on Pregel API in the GraphX programming guide: '... the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction /constrained to the topology of the graph/.'. Does that answer your question? Did you read the programming guide? - Robin East Spark

Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc. On Thu, 23 Mar 2017 at 00:03 chris snow wrote: > The documentation for collaborative filtering is as follows: > > === > Scaling of the regularization parameter > > Since v1.1, we scale the regularization parameter

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
GraphX is not synonymous with Pregel. To quote the GraphX programming guide 'GraphX exposes a variant of the Pregel API.'. There is no compute() function in GraphX - see the Pregel API section of the programming

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
Not that I'm aware of. Where did you read that? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Persist RDD doubt

2017-03-23 Thread nayan sharma
In case of task failures,does spark clear the persisted RDD (StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is attempted to start from beginning. Or will the cached RDD be appended. How does spark checks whether the RDD has been cached and skips the caching step for a

[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-23 Thread Behroz Sikander
Hello, Spark version: 1.6.2 Hadoop: 2.6.0 Cluster: All VMS are deployed on AWS. 1 Master (t2.large) 1 Secondary Master (t2.large) 5 Workers (m4.xlarge) Zookeeper (t2.large) Recently, 2 of our workers went down with out of memory exception. > java.lang.OutOfMemoryError: GC overhead limit

Re: Persist RDD doubt

2017-03-23 Thread Artur R
I am not pretty sure, but: - if RDD persisted in memory then on task fail executor JVM process fails too, so the memory is released - if RDD persisted on disk then on task fail Spark shutdown hook just wipes temp files On Thu, Mar 23, 2017 at 10:55 AM, Jörn Franke wrote:

Re: Persist RDD doubt

2017-03-23 Thread Jörn Franke
What do you mean by clear ? What is the use case? > On 23 Mar 2017, at 10:16, nayan sharma wrote: > > Does Spark clears the persisted RDD in case if the task fails ? > > Regards, > > Nayan

Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
Hi, In the latest release of SPARK I have seen significant improvements in case your data is in parquet format, which I see it is. But since you are not using spark session and using older API's of spark with spark sqlContext therefore there is a high chance that you are not using the spark

Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
And on another note, is there any particular reason for you using s3a:// instead of s3://? Regards, Gourav On Wed, Mar 22, 2017 at 8:30 PM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to

Persist RDD doubt

2017-03-23 Thread nayan sharma
Does Spark clears the persisted RDD in case if the task fails ? Regards, Nayan

Aggregated column name

2017-03-23 Thread Wen Pei Yu
Hi All   I found some spark version(spark 1.4) return upper case aggregated column,  and some return low case. As below code, df.groupby(col("...")).agg(count("number"))  may return   COUNT(number)  -- spark 1,4 count(number) - spark 1.6   Anyone know if there is configure parameter for

Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread chris snow
The documentation for collaborative filtering is as follows: === Scaling of the regularization parameter Since v1.1, we scale the regularization parameter lambda in solving each least squares problem by the number of ratings the user generated in updating user factors, or the number of ratings

Mismatch in data type comparision results full data in Spark

2017-03-23 Thread santlal56
Hi, I am using *where method* of dataframe to filter data. I am comparing Integer field with String type data, this comparision results full table data. I have tested same scenario with HIVE and MYSQL but this comparision will not give any result. *Scenario : * val sqlDf = df.where("f1 =

Re: Spark data frame map problem

2017-03-23 Thread Yan Facai
Could you give more details of your code? On Wed, Mar 22, 2017 at 2:40 AM, Shashank Mandil wrote: > Hi All, > > I have a spark data frame which has 992 rows inside it. > When I run a map on this data frame I expect that the map should work for > all the 992 rows. >