Re: Aggregated column name

2017-03-23 Thread Wen Pei Yu

Thanks. Kevin

This works for one or two column agg.
But not work for this:

val expr = (Map("forCount" -> "count") ++ features.map((_ -> "mean")))
val averageDF = originalDF
  .withColumn("forCount", lit(0))
  .groupBy(col("..."))
  .agg(expr)

Yu Wenpei.



From:   Kevin Mellott 
To: Wen Pei Yu 
Cc: user 
Date:   03/24/2017 09:48 AM
Subject:Re: Aggregated column name



I'm not sure of the answer to your question; however, when performing
aggregates I find it useful to specify an alias for each column. That will
give you explicit control over the name of the resulting column.

In your example, that would look something like:

df.groupby(col("...")).agg(count("number")).alias("ColumnNameCount")

Hope that helps!
Kevin

On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu  wrote:
  Hi All

  I found some spark version(spark 1.4) return upper case aggregated
  column,  and some return low case.
  As below code,
  df.groupby(col("...")).agg(count("number"))
  may return

  COUNT(number)  -- spark 1,4
  count(number) - spark 1.6

  Anyone know if there is configure parameter for this, or which PR change
  this?

  Thank you very much.
  Yu Wenpei.

  - To
  unsubscribe e-mail: user-unsubscr...@spark.apache.org




Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
Ryan, you are right. That was issue. It works now. Thanks.

On Thu, Mar 23, 2017 at 8:26 PM, Ryan  wrote:

> you should import either spark.implicits or sqlContext.implicits, not
> both. Otherwise the compiler will be confused about two implicit
> transformations
>
> following code works for me, spark version 2.1.0
>
> object Test {
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder
>   .master("local")
>   .appName(getClass.getSimpleName)
>   .getOrCreate()
> import spark.implicits._
> val df = Seq(TeamUser("t1", "u1", "r1")).toDF()
> df.printSchema()
> spark.close()
>   }
> }
>
> case class TeamUser(teamId: String, userId: String, role: String)
>
>
> On Fri, Mar 24, 2017 at 5:23 AM, shyla deshpande  > wrote:
>
>> I made the code even more simpler still getting the error
>>
>> error: value toDF is not a member of Seq[com.whil.batch.Teamuser]
>> [ERROR] val df = Seq(Teamuser("t1","u1","r1")).toDF()
>>
>> object Test {
>>   def main(args: Array[String]) {
>> val spark = SparkSession
>>   .builder
>>   .appName(getClass.getSimpleName)
>>   .getOrCreate()
>> import spark.implicits._
>> val sqlContext = spark.sqlContext
>> import sqlContext.implicits._
>> val df = Seq(Teamuser("t1","u1","r1")).toDF()
>> df.printSchema()
>>   }
>> }
>> case class Teamuser(teamid:String, userid:String, role:String)
>>
>>
>>
>>
>> On Thu, Mar 23, 2017 at 1:07 PM, Yong Zhang  wrote:
>>
>>> Not sure I understand this problem, why I cannot reproduce it?
>>>
>>>
>>> scala> spark.version
>>> res22: String = 2.1.0
>>>
>>> scala> case class Teamuser(teamid: String, userid: String, role: String)
>>> defined class Teamuser
>>>
>>> scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF
>>> df: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 
>>> more field]
>>>
>>> scala> df.show
>>> +--+--+-+
>>> |teamid|userid| role|
>>> +--+--+-+
>>> |t1|u1|role1|
>>> +--+--+-+
>>>
>>> scala> df.createOrReplaceTempView("teamuser")
>>>
>>> scala> val newDF = spark.sql("select teamid, userid, role from teamuser")
>>> newDF: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 
>>> 1 more field]
>>>
>>> scala> val userDS: Dataset[Teamuser] = newDF.as[Teamuser]
>>> userDS: org.apache.spark.sql.Dataset[Teamuser] = [teamid: string, userid: 
>>> string ... 1 more field]
>>>
>>> scala> userDS.show
>>> +--+--+-+
>>> |teamid|userid| role|
>>> +--+--+-+
>>> |t1|u1|role1|
>>> +--+--+-+
>>>
>>>
>>> scala> userDS.printSchema
>>> root
>>>  |-- teamid: string (nullable = true)
>>>  |-- userid: string (nullable = true)
>>>  |-- role: string (nullable = true)
>>>
>>>
>>> Am I missing anything?
>>>
>>>
>>> Yong
>>>
>>>
>>> --
>>> *From:* shyla deshpande 
>>> *Sent:* Thursday, March 23, 2017 3:49 PM
>>> *To:* user
>>> *Subject:* Re: Converting dataframe to dataset question
>>>
>>> I realized, my case class was inside the object. It should be defined
>>> outside the scope of the object. Thanks
>>>
>>> On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
 Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
 help. Thanks

 val spark = SparkSession
   .builder
   .config("spark.cassandra.connection.host", cassandrahost)
   .appName(getClass.getSimpleName)
   .getOrCreate()

 import spark.implicits._
 val sqlContext = spark.sqlContext
 import sqlContext.implicits._

 case class Teamuser(teamid:String, userid:String, role:String)
 spark
   .read
   .format("org.apache.spark.sql.cassandra")
   .options(Map("keyspace" -> "test", "table" -> "teamuser"))
   .load
   .createOrReplaceTempView("teamuser")

 val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")

 userDF.show()

 val userDS:Dataset[Teamuser] = userDF.as[Teamuser]


>>>
>>
>


Re: How to load "kafka" as a data source

2017-03-23 Thread Deepu Raj
Please check tools ver are same throughout.

Thanks
Deepu

On Fri, 24 Mar 2017 14:47:11 +1100, Gaurav1809   
wrote:

> Hi All,
>
> I am running a simple command on spark-shell - like this. It's a piece of
> structured streaming.
>
> val lines = (spark
>.readStream
>.format("kafka")
>.option("kafka.bootstrap.servers", "localhost:9092")
>.option("subscribe", "test")
>.load()
>.selectExpr("CAST(value AS STRING)")
>.as[String]
> )
>
> I also downloaded spark-streaming-kafka-0-10-assembly_2.11-2.1.0.jar and
> placed it in jars folder. re ran the spark-shell and executed above  
> command.
> but no luck. getting following error.
>
> java.lang.ClassNotFoundException: Failed to find data source: kafka.  
> Please
> find packages at http://spark.apache.org/third-party-projects.html
>
> Please suggest if I am missing anything here. I am running spark on  
> windows.
>
>
>
> --
> View this message in context:  
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-kafka-as-a-data-source-tp28534.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


-- 
Using Opera's mail client: http://www.opera.com/mail/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to load "kafka" as a data source

2017-03-23 Thread Gaurav1809
Hi All,

I am running a simple command on spark-shell - like this. It's a piece of
structured streaming.

val lines = (spark
   .readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", "localhost:9092")
   .option("subscribe", "test")
   .load()
   .selectExpr("CAST(value AS STRING)")
   .as[String]
)

I also downloaded spark-streaming-kafka-0-10-assembly_2.11-2.1.0.jar and
placed it in jars folder. re ran the spark-shell and executed above command.
but no luck. getting following error.

java.lang.ClassNotFoundException: Failed to find data source: kafka. Please
find packages at http://spark.apache.org/third-party-projects.html

Please suggest if I am missing anything here. I am running spark on windows.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-load-kafka-as-a-data-source-tp28534.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Ryan
no you don't need one hot. but since the feature column is a vector and
vector only accepts numbers, if your feature is string then a StringIndexer
is needed.

http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
here's an example.

On Thu, Mar 23, 2017 at 10:34 PM, Aseem Bansal  wrote:

> I was reading http://datascience.stackexchange.com/questions/
> 5226/strings-as-features-in-decision-tree-random-forest and found that
> needs to be done in sklearn. Is that required in spark?
>


Re: Converting dataframe to dataset question

2017-03-23 Thread Ryan
you should import either spark.implicits or sqlContext.implicits, not both.
Otherwise the compiler will be confused about two implicit transformations

following code works for me, spark version 2.1.0

object Test {
  def main(args: Array[String]) {
val spark = SparkSession
  .builder
  .master("local")
  .appName(getClass.getSimpleName)
  .getOrCreate()
import spark.implicits._
val df = Seq(TeamUser("t1", "u1", "r1")).toDF()
df.printSchema()
spark.close()
  }
}

case class TeamUser(teamId: String, userId: String, role: String)


On Fri, Mar 24, 2017 at 5:23 AM, shyla deshpande 
wrote:

> I made the code even more simpler still getting the error
>
> error: value toDF is not a member of Seq[com.whil.batch.Teamuser]
> [ERROR] val df = Seq(Teamuser("t1","u1","r1")).toDF()
>
> object Test {
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder
>   .appName(getClass.getSimpleName)
>   .getOrCreate()
> import spark.implicits._
> val sqlContext = spark.sqlContext
> import sqlContext.implicits._
> val df = Seq(Teamuser("t1","u1","r1")).toDF()
> df.printSchema()
>   }
> }
> case class Teamuser(teamid:String, userid:String, role:String)
>
>
>
>
> On Thu, Mar 23, 2017 at 1:07 PM, Yong Zhang  wrote:
>
>> Not sure I understand this problem, why I cannot reproduce it?
>>
>>
>> scala> spark.version
>> res22: String = 2.1.0
>>
>> scala> case class Teamuser(teamid: String, userid: String, role: String)
>> defined class Teamuser
>>
>> scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF
>> df: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 
>> more field]
>>
>> scala> df.show
>> +--+--+-+
>> |teamid|userid| role|
>> +--+--+-+
>> |t1|u1|role1|
>> +--+--+-+
>>
>> scala> df.createOrReplaceTempView("teamuser")
>>
>> scala> val newDF = spark.sql("select teamid, userid, role from teamuser")
>> newDF: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 
>> 1 more field]
>>
>> scala> val userDS: Dataset[Teamuser] = newDF.as[Teamuser]
>> userDS: org.apache.spark.sql.Dataset[Teamuser] = [teamid: string, userid: 
>> string ... 1 more field]
>>
>> scala> userDS.show
>> +--+--+-+
>> |teamid|userid| role|
>> +--+--+-+
>> |t1|u1|role1|
>> +--+--+-+
>>
>>
>> scala> userDS.printSchema
>> root
>>  |-- teamid: string (nullable = true)
>>  |-- userid: string (nullable = true)
>>  |-- role: string (nullable = true)
>>
>>
>> Am I missing anything?
>>
>>
>> Yong
>>
>>
>> --
>> *From:* shyla deshpande 
>> *Sent:* Thursday, March 23, 2017 3:49 PM
>> *To:* user
>> *Subject:* Re: Converting dataframe to dataset question
>>
>> I realized, my case class was inside the object. It should be defined
>> outside the scope of the object. Thanks
>>
>> On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
>>> help. Thanks
>>>
>>> val spark = SparkSession
>>>   .builder
>>>   .config("spark.cassandra.connection.host", cassandrahost)
>>>   .appName(getClass.getSimpleName)
>>>   .getOrCreate()
>>>
>>> import spark.implicits._
>>> val sqlContext = spark.sqlContext
>>> import sqlContext.implicits._
>>>
>>> case class Teamuser(teamid:String, userid:String, role:String)
>>> spark
>>>   .read
>>>   .format("org.apache.spark.sql.cassandra")
>>>   .options(Map("keyspace" -> "test", "table" -> "teamuser"))
>>>   .load
>>>   .createOrReplaceTempView("teamuser")
>>>
>>> val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")
>>>
>>> userDF.show()
>>>
>>> val userDS:Dataset[Teamuser] = userDF.as[Teamuser]
>>>
>>>
>>
>


Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
Thanks a million Yong. Great help!!! It solved my problem.

On Thu, Mar 23, 2017 at 6:00 PM, Yong Zhang  wrote:

> Change:
>
> val arrayinput = input.getAs[Array[String]](0)
>
> to:
>
> val arrayinput = input.getAs[*Seq*[String]](0)
>
>
> Yong
>
>
> --
> *From:* shyla deshpande 
> *Sent:* Thursday, March 23, 2017 8:18 PM
> *To:* user
> *Subject:* Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!
>
> This is my input data. The UDAF needs to aggregate the goals for a team
> and return a map that  gives the count for every goal in the team.
> I am getting the following error
>
> java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef
> cannot be cast to [Ljava.lang.String;
> at com.whil.common.GoalAggregator.update(GoalAggregator.scala:27)
>
> +--+--+
> |teamid|goals |
> +--+--+
> |t1|[Goal1, Goal2]|
> |t1|[Goal1, Goal3]|
> |t2|[Goal1, Goal2]|
> |t3|[Goal2, Goal3]|
> +--+--+
>
> root
>  |-- teamid: string (nullable = true)
>  |-- goals: array (nullable = true)
>  ||-- element: string (containsNull = true)
>
> /Calling the UDAF//
>
> object TestUDAF {
>   def main(args: Array[String]): Unit = {
>
> val spark = SparkSession
>   .builder
>   .getOrCreate()
>
> val sc: SparkContext = spark.sparkContext
> val sqlContext = spark.sqlContext
>
> import sqlContext.implicits._
>
> val data = Seq(
>   ("t1", Seq("Goal1", "Goal2")),
>   ("t1", Seq("Goal1", "Goal3")),
>   ("t2", Seq("Goal1", "Goal2")),
>   ("t3", Seq("Goal2", "Goal3"))).toDF("teamid","goals")
>
> data.show(truncate = false)
> data.printSchema()
>
> import spark.implicits._
>
> val sumgoals = new GoalAggregator
> val result = data.groupBy("teamid").agg(sumgoals(col("goals")))
>
> result.show(truncate = false)
>
>   }
> }
>
> ///UDAF/
>
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
>
> class GoalAggregator extends UserDefinedAggregateFunction{
>
>   override def inputSchema: org.apache.spark.sql.types.StructType =
>   StructType(StructField("value", ArrayType(StringType)) :: Nil)
>
>   override def bufferSchema: StructType = StructType(
>   StructField("combined", MapType(StringType,IntegerType)) :: Nil
>   )
>
>   override def dataType: DataType = MapType(StringType,IntegerType)
>
>   override def deterministic: Boolean = true
>
>   override def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer.update(0, Map[String, Integer]())
>   }
>
>   override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> val mapbuf = buffer.getAs[Map[String, Int]](0)
> val arrayinput = input.getAs[Array[String]](0)
> val result = mapbuf ++ arrayinput.map(goal => {
>   val cnt  = mapbuf.get(goal).getOrElse(0) + 1
>   goal -> cnt
> })
> buffer.update(0, result)
>   }
>
>   override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
> {
> val map1 = buffer1.getAs[Map[String, Int]](0)
> val map2 = buffer2.getAs[Map[String, Int]](0)
> val result = map1 ++ map2.map { case (k,v) =>
>   val cnt = map1.get(k).getOrElse(0) + 1
>   k -> cnt
> }
> buffer1.update(0, result)
>   }
>
>   override def evaluate(buffer: Row): Any = {
> buffer.getAs[Map[String, Int]](0)
>   }
> }
>
>
>
>


Re: Aggregated column name

2017-03-23 Thread Kevin Mellott
I'm not sure of the answer to your question; however, when performing
aggregates I find it useful to specify an *alias* for each column. That
will give you explicit control over the name of the resulting column.

In your example, that would look something like:

df.groupby(col("...")).agg(count("number"))*.alias("ColumnNameCount")*

Hope that helps!
Kevin

On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu  wrote:

> Hi All
>
> I found some spark version(spark 1.4) return upper case aggregated
> column,  and some return low case.
> As below code,
> df.groupby(col("...")).agg(count("number"))
> may return
>
> COUNT(number)  -- spark 1,4
> count(number) - spark 1.6
>
> Anyone know if there is configure parameter for this, or which PR change
> this?
>
> Thank you very much.
> Yu Wenpei.
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
Hi Mathieu,

I'm CCing the Spark user list since this will be of general interest to the
forum.  Unfortunately, there is not a way to begin LDA training with an
existing model currently.  Some MLlib models have been augmented to support
specifying an "initialModel" argument, but LDA does not have this yet.
Please feel free to make a feature request JIRA for it!

Thanks,
Joseph

On Thu, Mar 23, 2017 at 4:54 PM, Mathieu DESPRIEE 
wrote:

> Hello Joseph,
>
> I saw your contribution to online LDA in Spark (SPARK-5563). Please allow
> me a very quick question :
>
> I'm very much interested in training an LDA model incrementally with new
> batches of documents. This online algorithm seems to fit, but from what I
> understand of the current ml API, this is not possible to update a trained
> model with new documents.
> Is it ?
>
> Is there any way to get around the API and do that ?
>
> Thanks in advance for your insight.
>
> Mathieu
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread Yong Zhang
Change:

val arrayinput = input.getAs[Array[String]](0)

to:

val arrayinput = input.getAs[Seq[String]](0)


Yong



From: shyla deshpande 
Sent: Thursday, March 23, 2017 8:18 PM
To: user
Subject: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

This is my input data. The UDAF needs to aggregate the goals for a team and 
return a map that  gives the count for every goal in the team.
I am getting the following error

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef 
cannot be cast to [Ljava.lang.String;
at com.whil.common.GoalAggregator.update(GoalAggregator.scala:27)

+--+--+
|teamid|goals |
+--+--+
|t1|[Goal1, Goal2]|
|t1|[Goal1, Goal3]|
|t2|[Goal1, Goal2]|
|t3|[Goal2, Goal3]|
+--+--+

root
 |-- teamid: string (nullable = true)
 |-- goals: array (nullable = true)
 ||-- element: string (containsNull = true)

/Calling the UDAF//

object TestUDAF {
  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder
  .getOrCreate()

val sc: SparkContext = spark.sparkContext
val sqlContext = spark.sqlContext

import sqlContext.implicits._

val data = Seq(
  ("t1", Seq("Goal1", "Goal2")),
  ("t1", Seq("Goal1", "Goal3")),
  ("t2", Seq("Goal1", "Goal2")),
  ("t3", Seq("Goal2", "Goal3"))).toDF("teamid","goals")

data.show(truncate = false)
data.printSchema()

import spark.implicits._

val sumgoals = new GoalAggregator
val result = data.groupBy("teamid").agg(sumgoals(col("goals")))

result.show(truncate = false)

  }
}

///UDAF/

import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

class GoalAggregator extends UserDefinedAggregateFunction{

  override def inputSchema: org.apache.spark.sql.types.StructType =
  StructType(StructField("value", ArrayType(StringType)) :: Nil)

  override def bufferSchema: StructType = StructType(
  StructField("combined", MapType(StringType,IntegerType)) :: Nil
  )

  override def dataType: DataType = MapType(StringType,IntegerType)

  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, Map[String, Integer]())
  }

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val mapbuf = buffer.getAs[Map[String, Int]](0)
val arrayinput = input.getAs[Array[String]](0)
val result = mapbuf ++ arrayinput.map(goal => {
  val cnt  = mapbuf.get(goal).getOrElse(0) + 1
  goal -> cnt
})
buffer.update(0, result)
  }

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val map1 = buffer1.getAs[Map[String, Int]](0)
val map2 = buffer2.getAs[Map[String, Int]](0)
val result = map1 ++ map2.map { case (k,v) =>
  val cnt = map1.get(k).getOrElse(0) + 1
  k -> cnt
}
buffer1.update(0, result)
  }

  override def evaluate(buffer: Row): Any = {
buffer.getAs[Map[String, Int]](0)
  }
}




Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread shyla deshpande
This is my input data. The UDAF needs to aggregate the goals for a team and
return a map that  gives the count for every goal in the team.
I am getting the following error

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef
cannot be cast to [Ljava.lang.String;
at com.whil.common.GoalAggregator.update(GoalAggregator.scala:27)

+--+--+
|teamid|goals |
+--+--+
|t1|[Goal1, Goal2]|
|t1|[Goal1, Goal3]|
|t2|[Goal1, Goal2]|
|t3|[Goal2, Goal3]|
+--+--+

root
 |-- teamid: string (nullable = true)
 |-- goals: array (nullable = true)
 ||-- element: string (containsNull = true)

/Calling the UDAF//

object TestUDAF {
  def main(args: Array[String]): Unit = {

val spark = SparkSession
  .builder
  .getOrCreate()

val sc: SparkContext = spark.sparkContext
val sqlContext = spark.sqlContext

import sqlContext.implicits._

val data = Seq(
  ("t1", Seq("Goal1", "Goal2")),
  ("t1", Seq("Goal1", "Goal3")),
  ("t2", Seq("Goal1", "Goal2")),
  ("t3", Seq("Goal2", "Goal3"))).toDF("teamid","goals")

data.show(truncate = false)
data.printSchema()

import spark.implicits._

val sumgoals = new GoalAggregator
val result = data.groupBy("teamid").agg(sumgoals(col("goals")))

result.show(truncate = false)

  }
}

///UDAF/

import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

class GoalAggregator extends UserDefinedAggregateFunction{

  override def inputSchema: org.apache.spark.sql.types.StructType =
  StructType(StructField("value", ArrayType(StringType)) :: Nil)

  override def bufferSchema: StructType = StructType(
  StructField("combined", MapType(StringType,IntegerType)) :: Nil
  )

  override def dataType: DataType = MapType(StringType,IntegerType)

  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0, Map[String, Integer]())
  }

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val mapbuf = buffer.getAs[Map[String, Int]](0)
val arrayinput = input.getAs[Array[String]](0)
val result = mapbuf ++ arrayinput.map(goal => {
  val cnt  = mapbuf.get(goal).getOrElse(0) + 1
  goal -> cnt
})
buffer.update(0, result)
  }

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val map1 = buffer1.getAs[Map[String, Int]](0)
val map2 = buffer2.getAs[Map[String, Int]](0)
val result = map1 ++ map2.map { case (k,v) =>
  val cnt = map1.get(k).getOrElse(0) + 1
  k -> cnt
}
buffer1.update(0, result)
  }

  override def evaluate(buffer: Row): Any = {
buffer.getAs[Map[String, Int]](0)
  }
}


Re: Persist RDD doubt

2017-03-23 Thread sjayatheertha


Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will 
automatically be recomputed using the transformations that originally created 
it.




> On Mar 23, 2017, at 4:11 AM, nayan sharma  wrote:
> 
> In case of task failures,does spark clear the persisted RDD 
> (StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is 
> attempted to start from beginning. Or will the cached RDD be appended.
> 
> How does spark checks whether the RDD has been cached and skips the caching 
> step for a particular task.
> 
>> On 23-Mar-2017, at 3:36 PM, Artur R  wrote:
>> 
>> I am not pretty sure, but:
>>  - if RDD persisted in memory then on task fail executor JVM process fails 
>> too, so the memory is released
>>  - if RDD persisted on disk then on task fail Spark shutdown hook just wipes 
>> temp files
>> 
>>> On Thu, Mar 23, 2017 at 10:55 AM, Jörn Franke  wrote:
>>> What do you mean by clear ? What is the use case?
>>> 
 On 23 Mar 2017, at 10:16, nayan sharma  wrote:
 
 Does Spark clears the persisted RDD in case if the task fails ?
 
 Regards,
 
 Nayan
>> 
> 


Re: how to read object field within json file

2017-03-23 Thread Yong Zhang
That's why your "source" should be defined as an Array[Struct] type (which 
makes sense in this case, it has an undetermined length  , so you can explode 
it and get the description easily.

Now you need write your own UDF, maybe can do what you want.

Yong


From: Selvam Raman 
Sent: Thursday, March 23, 2017 5:03 PM
To: user
Subject: how to read object field within json file

Hi,

{
"id": "test1",
"source": {
"F1": {
  "id": "4970",
  "eId": "F1",
  "description": "test1",
},
"F2": {
  "id": "5070",
  "eId": "F2",
  "description": "test2",
},
"F3": {
  "id": "5170",
  "eId": "F3",
  "description": "test3",
},
"F4":{}
  etc..
  "F999":{}
}

I am having bzip json files like above format.
some json row contains two objects within source(like F1 and F2), sometime 
five(F1,F2,F3,F4,F5),etc. So the final schema will contains combination of all 
objects for the source field.

Now, every row will contain n number of objects but only some contains valid 
records.
how can i retreive the value of "description" in "source" field.

source.F1.description - returns the result but how can i get all description 
result for every row..(something like this "source.*.description").

--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
I made the code even more simpler still getting the error

error: value toDF is not a member of Seq[com.whil.batch.Teamuser]
[ERROR] val df = Seq(Teamuser("t1","u1","r1")).toDF()

object Test {
  def main(args: Array[String]) {
val spark = SparkSession
  .builder
  .appName(getClass.getSimpleName)
  .getOrCreate()
import spark.implicits._
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val df = Seq(Teamuser("t1","u1","r1")).toDF()
df.printSchema()
  }
}
case class Teamuser(teamid:String, userid:String, role:String)




On Thu, Mar 23, 2017 at 1:07 PM, Yong Zhang  wrote:

> Not sure I understand this problem, why I cannot reproduce it?
>
>
> scala> spark.version
> res22: String = 2.1.0
>
> scala> case class Teamuser(teamid: String, userid: String, role: String)
> defined class Teamuser
>
> scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF
> df: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 
> more field]
>
> scala> df.show
> +--+--+-+
> |teamid|userid| role|
> +--+--+-+
> |t1|u1|role1|
> +--+--+-+
>
> scala> df.createOrReplaceTempView("teamuser")
>
> scala> val newDF = spark.sql("select teamid, userid, role from teamuser")
> newDF: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 
> more field]
>
> scala> val userDS: Dataset[Teamuser] = newDF.as[Teamuser]
> userDS: org.apache.spark.sql.Dataset[Teamuser] = [teamid: string, userid: 
> string ... 1 more field]
>
> scala> userDS.show
> +--+--+-+
> |teamid|userid| role|
> +--+--+-+
> |t1|u1|role1|
> +--+--+-+
>
>
> scala> userDS.printSchema
> root
>  |-- teamid: string (nullable = true)
>  |-- userid: string (nullable = true)
>  |-- role: string (nullable = true)
>
>
> Am I missing anything?
>
>
> Yong
>
>
> --
> *From:* shyla deshpande 
> *Sent:* Thursday, March 23, 2017 3:49 PM
> *To:* user
> *Subject:* Re: Converting dataframe to dataset question
>
> I realized, my case class was inside the object. It should be defined
> outside the scope of the object. Thanks
>
> On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande  > wrote:
>
>> Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
>> help. Thanks
>>
>> val spark = SparkSession
>>   .builder
>>   .config("spark.cassandra.connection.host", cassandrahost)
>>   .appName(getClass.getSimpleName)
>>   .getOrCreate()
>>
>> import spark.implicits._
>> val sqlContext = spark.sqlContext
>> import sqlContext.implicits._
>>
>> case class Teamuser(teamid:String, userid:String, role:String)
>> spark
>>   .read
>>   .format("org.apache.spark.sql.cassandra")
>>   .options(Map("keyspace" -> "test", "table" -> "teamuser"))
>>   .load
>>   .createOrReplaceTempView("teamuser")
>>
>> val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")
>>
>> userDF.show()
>>
>> val userDS:Dataset[Teamuser] = userDF.as[Teamuser]
>>
>>
>


how to read object field within json file

2017-03-23 Thread Selvam Raman
Hi,

{
"id": "test1",
"source": {
"F1": {
  "id": "4970",
  "eId": "F1",
  "description": "test1",
},
"F2": {
  "id": "5070",
  "eId": "F2",
  "description": "test2",
},
"F3": {
  "id": "5170",
  "eId": "F3",
  "description": "test3",
},
"F4":{}
  etc..
  "F999":{}
}

I am having bzip json files like above format.
some json row contains two objects within source(like F1 and F2), sometime
five(F1,F2,F3,F4,F5),etc. So the final schema will contains combination of
all objects for the source field.

Now, every row will contain n number of objects but only some contains
valid records.
how can i retreive the value of "description" in "source" field.

source.F1.description - returns the result but how can i get all
description result for every row..(something like this
"source.*.description").

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
now I get a run time error...

error: Unable to find encoder for type stored in a Dataset.  Primitive
types (Int, String, etc) and Product types (case classes) are supported by
importing spark.implicits._  Support for serializing other types will be
added in future releases.
[ERROR] val userDS:Dataset[Teamuser] = userDF.as[Teamuser]

On Thu, Mar 23, 2017 at 12:49 PM, shyla deshpande 
wrote:

> I realized, my case class was inside the object. It should be defined
> outside the scope of the object. Thanks
>
> On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande  > wrote:
>
>> Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
>> help. Thanks
>>
>> val spark = SparkSession
>>   .builder
>>   .config("spark.cassandra.connection.host", cassandrahost)
>>   .appName(getClass.getSimpleName)
>>   .getOrCreate()
>>
>> import spark.implicits._
>> val sqlContext = spark.sqlContext
>> import sqlContext.implicits._
>>
>> case class Teamuser(teamid:String, userid:String, role:String)
>> spark
>>   .read
>>   .format("org.apache.spark.sql.cassandra")
>>   .options(Map("keyspace" -> "test", "table" -> "teamuser"))
>>   .load
>>   .createOrReplaceTempView("teamuser")
>>
>> val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")
>>
>> userDF.show()
>>
>> val userDS:Dataset[Teamuser] = userDF.as[Teamuser]
>>
>>
>


Re: Converting dataframe to dataset question

2017-03-23 Thread Yong Zhang
Not sure I understand this problem, why I cannot reproduce it?


scala> spark.version
res22: String = 2.1.0

scala> case class Teamuser(teamid: String, userid: String, role: String)
defined class Teamuser

scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF
df: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 more 
field]

scala> df.show
+--+--+-+
|teamid|userid| role|
+--+--+-+
|t1|u1|role1|
+--+--+-+

scala> df.createOrReplaceTempView("teamuser")

scala> val newDF = spark.sql("select teamid, userid, role from teamuser")
newDF: org.apache.spark.sql.DataFrame = [teamid: string, userid: string ... 1 
more field]

scala> val userDS: Dataset[Teamuser] = newDF.as[Teamuser]
userDS: org.apache.spark.sql.Dataset[Teamuser] = [teamid: string, userid: 
string ... 1 more field]

scala> userDS.show
+--+--+-+
|teamid|userid| role|
+--+--+-+
|t1|u1|role1|
+--+--+-+


scala> userDS.printSchema
root
 |-- teamid: string (nullable = true)
 |-- userid: string (nullable = true)
 |-- role: string (nullable = true)


Am I missing anything?


Yong



From: shyla deshpande 
Sent: Thursday, March 23, 2017 3:49 PM
To: user
Subject: Re: Converting dataframe to dataset question

I realized, my case class was inside the object. It should be defined outside 
the scope of the object. Thanks

On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande 
> wrote:

Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
help. Thanks

val spark = SparkSession
  .builder
  .config("spark.cassandra.connection.host", cassandrahost)
  .appName(getClass.getSimpleName)
  .getOrCreate()

import spark.implicits._
val sqlContext = spark.sqlContext
import sqlContext.implicits._

case class Teamuser(teamid:String, userid:String, role:String)
spark
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map("keyspace" -> "test", "table" -> "teamuser"))
  .load
  .createOrReplaceTempView("teamuser")

val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")

userDF.show()

val userDS:Dataset[Teamuser] = userDF.as[Teamuser]




Re: Converting dataframe to dataset question

2017-03-23 Thread shyla deshpande
I realized, my case class was inside the object. It should be defined
outside the scope of the object. Thanks

On Wed, Mar 22, 2017 at 6:07 PM, shyla deshpande 
wrote:

> Why userDS is Dataset[Any], instead of Dataset[Teamuser]?  Appreciate your 
> help. Thanks
>
> val spark = SparkSession
>   .builder
>   .config("spark.cassandra.connection.host", cassandrahost)
>   .appName(getClass.getSimpleName)
>   .getOrCreate()
>
> import spark.implicits._
> val sqlContext = spark.sqlContext
> import sqlContext.implicits._
>
> case class Teamuser(teamid:String, userid:String, role:String)
> spark
>   .read
>   .format("org.apache.spark.sql.cassandra")
>   .options(Map("keyspace" -> "test", "table" -> "teamuser"))
>   .load
>   .createOrReplaceTempView("teamuser")
>
> val userDF = spark.sql("SELECT teamid, userid, role FROM teamuser")
>
> userDF.show()
>
> val userDS:Dataset[Teamuser] = userDF.as[Teamuser]
>
>


[ANNOUNCE] Apache Gora 0.7 Release

2017-03-23 Thread lewis john mcgibbney
Hi Folks,

The Apache Gora team are pleased to announce the immediate availability of
Apache Gora 0.7.
The Apache Gora open source framework provides an in-memory data model and
persistence for big data. Gora supports persisting to column stores, key
value stores, document stores and RDBMSs, and analyzing the data with
extensive Apache Hadoop™ MapReduce support.

The Gora DOAP can be found at http://gora.apache.org/current/doap_Gora.rdf

This release addresses 80 issues, for a breakdown please see the release
report . Drop by our mailing lists and ask
questions for information on any of the above.

Gora 0.7 provides support for the following projects

   - Apache Avro  1.8.1
   - Apache Hadoop  2.5.2
   - Apache HBase  1.2.3
   - Apache Cassandra  2.0.2
   - Apache Solr  5.5.1
   - MongoDB  (driver) 3.4.2
   - Apache Accumlo  1.7.1
   - Apache Spark  1.4.1
   - Apache CouchDB  1.4.2 (test containers
    1.1.0)
   - Amazon DynamoDB  (driver) 1.10.55
   - Infinispan  7.2.5.Final
   - JCache  1.0.0 with Hazelcast
    3.6.4 support.

Gora is released as both source code, downloads for which can be found at
our downloads page , as well as
Maven artifacts which can be found on Maven central
.
Thanks


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
Ok,
Thanks for your answers

On 3/22/17, 1:34 PM, "Cody Koeninger"  wrote:

If you're talking about reading the same message multiple times in a
failure situation, see

https://github.com/koeninger/kafka-exactly-once

If you're talking about producing the same message multiple times in a
failure situation, keep an eye on


https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

If you're talking about producers just misbehaving and sending
different copies of what is essentially the same message from a domain
perspective, you have to dedupe that with your own logic.

On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaver  wrote:
> You have to handle de-duplication upstream or downstream. It might
> technically be possible to handle this in Spark but you'll probably have a
> better time handling duplicates in the service that reads from Kafka.
>
> On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart 
> wrote:
>>
>> Hi,
>> we are trying to build a spark streaming solution that subscribe and push
>> to kafka.
>>
>> But we are running into the problem of duplicates events.
>>
>> Right now, I am doing a “forEachRdd” and loop over the message of each
>> partition and send those message to kafka.
>>
>>
>>
>> Is there any good way of solving that issue?
>>
>>
>>
>> thanks
>
>
>
>
> --
> Regards,
>
> Matt
> Data Engineer
> https://www.linkedin.com/in/mdeaver
> http://mattdeav.pythonanywhere.com/




Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
I usually advocate a JIRA even for small stuff but for doc only change like
this it's ok to submit a PR directly with [MINOR] in title.


On Thu, 23 Mar 2017 at 06:55, chris snow  wrote:

> Thanks Nick.  If this will help other users, I'll create a JIRA and
> send a patch.
>
> On 23 March 2017 at 13:49, Nick Pentreath 
> wrote:
> > Yup, that is true and a reasonable clarification of the doc.
> >
> > On Thu, 23 Mar 2017 at 00:03 chris snow  wrote:
> >>
> >> The documentation for collaborative filtering is as follows:
> >>
> >> ===
> >> Scaling of the regularization parameter
> >>
> >> Since v1.1, we scale the regularization parameter lambda in solving
> >> each least squares problem by the number of ratings the user generated
> >> in updating user factors, or the number of ratings the product
> >> received in updating product factors.
> >> ===
> >>
> >> I find this description confusing, probably because I lack a detailed
> >> understanding of ALS.   The wording suggest that the number of ratings
> >> change ("generated", "received") during solving the least squares.
> >>
> >> This is how I think I should be interpreting the description:
> >>
> >> ===
> >> Since v1.1, we scale the regularization parameter lambda when solving
> >> each least squares problem.  When updating the user factors, we scale
> >> the regularization parameter by the total number of ratings from the
> >> user.  Similarly, when updating the product factors, we scale the
> >> regularization parameter by the total number of ratings for the
> >> product.
> >> ===
> >>
> >> Have I understood this correctly?
> >>
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Application kill from UI do not propagate exception

2017-03-23 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I am trying to trap UI kill event of a spark application from driver.
Some how the exception thrown is not propagated to the driver main
program. See for example using spark-shell below.

Is there a way to get hold of this event and shutdown the driver program?

Regards,
Noorul


spark@spark1:~/spark-2.1.0/sbin$ spark-shell --master spark://10.29.83.162:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/03/23 15:16:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
17/03/23 15:16:53 WARN ObjectStore: Failed to get database
global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.29.83.162:4040
Spark context available as 'sc' (master = spark://10.29.83.162:7077,
app id = app-20170323151648-0002).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 17/03/23 15:17:28 ERROR StandaloneSchedulerBackend: Application
has been killed. Reason: Master removed our application: KILLED
17/03/23 15:17:28 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster
scheduler: Master removed our application: KILLED
at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459)
at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:139)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:254)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@25b8f9d2

scala>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Aseem Bansal
I was reading
http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest
and found that needs to be done in sklearn. Is that required in spark?


[PySpark] - Binary File Partition

2017-03-23 Thread jjayadeep
Hi,

I am using Spark 1.6.2 and is there a known bug where number of partitions
will always be 2 when minPartitions is not specified as below

images =
sc.binaryFiles("s3n://AKIAIOJYJILW24BQSIEA:txGkP6YcOHTjBNHPLFbbgmxPfkVQoyUktsVCVKaf@imagefiles-gok/locofiles-data/")

I was looking at the source code for PortableDataStream.scala which I
believe is used for when we invoke the binary files interface and I see the
below code 

  def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions:
Int) {
val defaultMaxSplitBytes =
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
val defaultParallelism = sc.defaultParallelism
val files = listStatus(context).asScala
val totalBytes = files.filterNot(_.isDirectory).map(_.getLen +
openCostInBytes).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitSize = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
super.setMaxSplitSize(maxSplitSize)
  }

Does it mean that minPartitions will no longer be used in the partition
determination calculation?

Kindly advice.

Thanks,
Jayadeep



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Binary-File-Partition-tp28531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread chris snow
Thanks Nick.  If this will help other users, I'll create a JIRA and
send a patch.

On 23 March 2017 at 13:49, Nick Pentreath  wrote:
> Yup, that is true and a reasonable clarification of the doc.
>
> On Thu, 23 Mar 2017 at 00:03 chris snow  wrote:
>>
>> The documentation for collaborative filtering is as follows:
>>
>> ===
>> Scaling of the regularization parameter
>>
>> Since v1.1, we scale the regularization parameter lambda in solving
>> each least squares problem by the number of ratings the user generated
>> in updating user factors, or the number of ratings the product
>> received in updating product factors.
>> ===
>>
>> I find this description confusing, probably because I lack a detailed
>> understanding of ALS.   The wording suggest that the number of ratings
>> change ("generated", "received") during solving the least squares.
>>
>> This is how I think I should be interpreting the description:
>>
>> ===
>> Since v1.1, we scale the regularization parameter lambda when solving
>> each least squares problem.  When updating the user factors, we scale
>> the regularization parameter by the total number of ratings from the
>> user.  Similarly, when updating the product factors, we scale the
>> regularization parameter by the total number of ratings for the
>> product.
>> ===
>>
>> Have I understood this correctly?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
>From the section on Pregel API in the GraphX programming guide: '... the
Pregel operator in GraphX is a bulk-synchronous parallel messaging
abstraction /constrained to the topology of the graph/.'. Does that answer
your question? Did you read the programming guide?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread Nick Pentreath
Yup, that is true and a reasonable clarification of the doc.

On Thu, 23 Mar 2017 at 00:03 chris snow  wrote:

> The documentation for collaborative filtering is as follows:
>
> ===
> Scaling of the regularization parameter
>
> Since v1.1, we scale the regularization parameter lambda in solving
> each least squares problem by the number of ratings the user generated
> in updating user factors, or the number of ratings the product
> received in updating product factors.
> ===
>
> I find this description confusing, probably because I lack a detailed
> understanding of ALS.   The wording suggest that the number of ratings
> change ("generated", "received") during solving the least squares.
>
> This is how I think I should be interpreting the description:
>
> ===
> Since v1.1, we scale the regularization parameter lambda when solving
> each least squares problem.  When updating the user factors, we scale
> the regularization parameter by the total number of ratings from the
> user.  Similarly, when updating the product factors, we scale the
> regularization parameter by the total number of ratings for the
> product.
> ===
>
> Have I understood this correctly?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
GraphX is not synonymous with Pregel. To quote the  GraphX programming guide
  
'GraphX exposes a variant of the Pregel API.'. There is no compute()
function in GraphX - see the Pregel API section of the programming guide for
details on how GraphX implements a Pregel-like API



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
Not that I'm aware of. Where did you read that?



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pregel-API-add-vertices-and-edges-tp28519p28523.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Persist RDD doubt

2017-03-23 Thread nayan sharma
In case of task failures,does spark clear the persisted RDD 
(StorageLevel.MEMORY_ONLY_SER) and recompute them again when the task is 
attempted to start from beginning. Or will the cached RDD be appended.

How does spark checks whether the RDD has been cached and skips the caching 
step for a particular task.

> On 23-Mar-2017, at 3:36 PM, Artur R  wrote:
> 
> I am not pretty sure, but:
>  - if RDD persisted in memory then on task fail executor JVM process fails 
> too, so the memory is released
>  - if RDD persisted on disk then on task fail Spark shutdown hook just wipes 
> temp files
> 
> On Thu, Mar 23, 2017 at 10:55 AM, Jörn Franke  > wrote:
> What do you mean by clear ? What is the use case?
> 
> On 23 Mar 2017, at 10:16, nayan sharma  > wrote:
> 
>> Does Spark clears the persisted RDD in case if the task fails ?
>> 
>> Regards,
>> 
>> Nayan
> 



[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-23 Thread Behroz Sikander
Hello,
Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception.

> java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)


Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91

Master's spark-default.conf file:
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d

Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.

Regards,
Behroz


Re: Persist RDD doubt

2017-03-23 Thread Artur R
I am not pretty sure, but:
 - if RDD persisted in memory then on task fail executor JVM process fails
too, so the memory is released
 - if RDD persisted on disk then on task fail Spark shutdown hook just
wipes temp files

On Thu, Mar 23, 2017 at 10:55 AM, Jörn Franke  wrote:

> What do you mean by clear ? What is the use case?
>
> On 23 Mar 2017, at 10:16, nayan sharma  wrote:
>
> Does Spark clears the persisted RDD in case if the task fails ?
>
> Regards,
> Nayan
>
>


Re: Persist RDD doubt

2017-03-23 Thread Jörn Franke
What do you mean by clear ? What is the use case?

> On 23 Mar 2017, at 10:16, nayan sharma  wrote:
> 
> Does Spark clears the persisted RDD in case if the task fails ?
> 
> Regards,
> 
> Nayan


Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
Hi,

In the latest release of SPARK I have seen significant improvements in case
your data is in parquet format, which I see it is.

But since you are not using spark session and using older API's of spark
with spark sqlContext therefore there is a high chance that you are not
using the spark improvements at all.

Is there any particular reason why you would not prefer using spark session?


Regards,
Gourav

On Wed, Mar 22, 2017 at 8:30 PM, Matt Deaver  wrote:

> For various reasons, our data set is partitioned in Spark by customer id
> and saved to S3. When trying to read this data, however, the larger
> partitions make it difficult to parallelize jobs. For example, out of a
> couple thousand companies, some have <10 MB data while some have >10GB.
> This is the code I'm using in a Zeppelin notebook and it takes a very long
> time to read in (2+ hours on a ~200 GB dataset from S3):
>
> df1 = sqlContext.read.parquet("s3a://[bucket1]/[prefix1]/")
> df2 = sqlContext.read.parquet("s3a://[bucket2]/[prefix2]/")
>
> # generate a bunch of derived columns here for df1
> df1 = df1.withColumn('derivedcol1', df1.source_col)
>
>
> # limit output columns for later union
> df1 = df1.select(
> [limited set of columns]
> )
>
> # generate a bunch of derived columns here for df2
> df2 = df2.withColumn('derivedcol1', df2.source_col)
>
> # limit output columns for later union
> df2 = df2.select(
> [limited set of columns]
> )
>
> print(df1.rdd.getNumPartitions())
> print(df2.rdd.getNumPartitions())
>
> merge_df = df1.unionAll(df2)
> merge_df.repartition(300)
>
> merge_df.registerTempTable("union_table")
> sqlContext.cacheTable("union_table")
> sqlContext.sql("select count(*) from union_table").collect()
>
> Any suggestions on making this faster?
>


Re: Best way to deal with skewed partition sizes

2017-03-23 Thread Gourav Sengupta
And on another note, is there any particular reason for you using s3a://
 instead of s3://?


Regards,
Gourav

On Wed, Mar 22, 2017 at 8:30 PM, Matt Deaver  wrote:

> For various reasons, our data set is partitioned in Spark by customer id
> and saved to S3. When trying to read this data, however, the larger
> partitions make it difficult to parallelize jobs. For example, out of a
> couple thousand companies, some have <10 MB data while some have >10GB.
> This is the code I'm using in a Zeppelin notebook and it takes a very long
> time to read in (2+ hours on a ~200 GB dataset from S3):
>
> df1 = sqlContext.read.parquet("s3a://[bucket1]/[prefix1]/")
> df2 = sqlContext.read.parquet("s3a://[bucket2]/[prefix2]/")
>
> # generate a bunch of derived columns here for df1
> df1 = df1.withColumn('derivedcol1', df1.source_col)
>
>
> # limit output columns for later union
> df1 = df1.select(
> [limited set of columns]
> )
>
> # generate a bunch of derived columns here for df2
> df2 = df2.withColumn('derivedcol1', df2.source_col)
>
> # limit output columns for later union
> df2 = df2.select(
> [limited set of columns]
> )
>
> print(df1.rdd.getNumPartitions())
> print(df2.rdd.getNumPartitions())
>
> merge_df = df1.unionAll(df2)
> merge_df.repartition(300)
>
> merge_df.registerTempTable("union_table")
> sqlContext.cacheTable("union_table")
> sqlContext.sql("select count(*) from union_table").collect()
>
> Any suggestions on making this faster?
>


Persist RDD doubt

2017-03-23 Thread nayan sharma
Does Spark clears the persisted RDD in case if the task fails ?

Regards,

Nayan

Aggregated column name

2017-03-23 Thread Wen Pei Yu
Hi All
 
I found some spark version(spark 1.4) return upper case aggregated column,  and some return low case.
As below code,
df.groupby(col("...")).agg(count("number")) 
may return
 
COUNT(number)  -- spark 1,4
count(number) - spark 1.6
 
Anyone know if there is configure parameter for this, or which PR change this?
 
Thank you very much.
Yu Wenpei.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Collaborative Filtering - scaling of the regularization parameter

2017-03-23 Thread chris snow
The documentation for collaborative filtering is as follows:

===
Scaling of the regularization parameter

Since v1.1, we scale the regularization parameter lambda in solving
each least squares problem by the number of ratings the user generated
in updating user factors, or the number of ratings the product
received in updating product factors.
===

I find this description confusing, probably because I lack a detailed
understanding of ALS.   The wording suggest that the number of ratings
change ("generated", "received") during solving the least squares.

This is how I think I should be interpreting the description:

===
Since v1.1, we scale the regularization parameter lambda when solving
each least squares problem.  When updating the user factors, we scale
the regularization parameter by the total number of ratings from the
user.  Similarly, when updating the product factors, we scale the
regularization parameter by the total number of ratings for the
product.
===

Have I understood this correctly?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Mismatch in data type comparision results full data in Spark

2017-03-23 Thread santlal56
Hi,

I am using *where method* of dataframe to filter data. 
I am comparing Integer field with String type data, this comparision results
full table data. 
I have tested same scenario with HIVE and MYSQL but this comparision will
not give any result. 

*Scenario : *

 val sqlDf = df.where("f1 = 'abc'") 
 here f1 : Integer
 
* Input:*
 14
 15
 16
 
* output: *
 14
 15
 16
 
*Logical and Physical Plan : *
 
 == Parsed Logical Plan ==
'Filter ('f1 = abc)
+- Relation[f1#0] csv

== Analyzed Logical Plan ==
f1: int
Filter (cast(f1#0 as double) = cast(abc as double))
+- Relation[f1#0] csv

== Optimized Logical Plan ==
Filter (isnotnull(f1#0) && null)
+- Relation[f1#0] csv

== Physical Plan ==
*Project [f1#0]
+- *Filter isnotnull(f1#0)
   +- *Scan csv [f1#0] Format: CSV, InputPaths:
file:/C:/Users/santlalg/IdeaProjects/SparkTestPoc/Int, PartitionFilters:
[null], PushedFilters: [IsNotNull(f1)], ReadSchema: struct

  
In *Optimized Logical Plan*, why *cast(f1#0 as double) > cast(abc as
double)* from *Analyzed Logical Plan* is replaced with /null/?
   
I am using below version of dependency:
Spark-core : 2.0.2
Spark-sql : 2.0.2

In My scenario this should be false, so that dataframe should not give any
result. 
Can someone help me to achieve this?

Thanks 
Santlal 
   
 
   



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mismatch-in-data-type-comparision-results-full-data-in-Spark-tp28521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark data frame map problem

2017-03-23 Thread Yan Facai
Could you give more details of your code?



On Wed, Mar 22, 2017 at 2:40 AM, Shashank Mandil 
wrote:

> Hi All,
>
> I have a spark data frame which has 992 rows inside it.
> When I run a map on this data frame I expect that the map should work for
> all the 992 rows.
>
> As a mapper runs on an executor on  a cluster I did a distributed count of
> the number of rows the mapper is being run on.
>
> dataframe.map(r => {
>//distributed count inside here using zookeeper
> })
>
> I have found that this distributed count inside the mapper is not exactly
> 992. I have found this number to vary with different runs.
>
> Does anybody have any idea what might be happening ? By the way, I am
> using spark 1.6.1
>
> Thanks,
> Shashank
>
>