[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Hosur Narahari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167736#comment-16167736
 ] 

Hosur Narahari commented on SPARK-22021:


If I just apply this function, I can't use it in spark's ML pipeline and will 
have to break it into 2 sub-pipelines where I've to apply a function(logic). 
But by providing a transformer we can make one single pipeline till end. Also 
it will be a one stop generic solution to get any derived feature.

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Hosur Narahari (JIRA)
Hosur Narahari created SPARK-22021:
--

 Summary: Add a feature transformation to accept a function and 
apply it on all rows of dataframe
 Key: SPARK-22021
 URL: https://issues.apache.org/jira/browse/SPARK-22021
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: Hosur Narahari


More often we generate derived features in ML pipeline by doing some 
mathematical or other kind of operation on columns of dataframe like getting a 
total of few columns as a new column or if there is text field message and we 
want the length of message etc. We currently don't have an efficient way to 
handle such scenario in ML pipeline.

By Providing a transformer which accepts a function and performs that on 
mentioned columns to generate output column of numerical type, user has the 
flexibility to derive features by applying any domain specific logic.

Example:

val function = "function(a,b) { return a+b;}"
val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
"v2")).setOutputCol("result").setFunction(function)
val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
val result = transformer.transform(df)
result.show

v1   v2  result
1.0 2.0 3.0
3.0 4.0 7.0




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one

2017-03-25 Thread Hosur Narahari (JIRA)
Hosur Narahari created SPARK-20093:
--

 Summary: Exception when Joining dataframe with another dataframe 
generated by applying groupBy transformation on original one
 Key: SPARK-20093
 URL: https://issues.apache.org/jira/browse/SPARK-20093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.2.0
Reporter: Hosur Narahari


When we generate a dataframe by doing grouping, and perform join on original 
dataframe with aggregate column, we get AnalysisException. Below I've attached 
a piece of code and resulting exception to reproduce.

Code:

import org.apache.spark.sql.SparkSession


object App {

  lazy val spark = 
SparkSession.builder.appName("Test").master("local").getOrCreate

  def main(args: Array[String]): Unit = {
test1
  }

  private def test1 {
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", 155, 
56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight")
val groupDF = df.groupBy("gender").agg(min("height").as("height"))
groupDF.show()
val out = groupDF.join(df, groupDF("height") <=> 
df("height")).select(df("gender"), df("height"), df("weight"))
out.show
  }
}

When I ran above code, I got below exception:

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
attribute(s) height#8 missing from 
height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, 
(height#19 <=> height#8);;
!Join Inner, (height#19 <=> height#8)
:- Aggregate [gender#7], [gender#7, min(height#8) AS height#19]
:  +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9]
: +- LocalRelation [_1#0, _2#1, _3#2]
+- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31]
   +- LocalRelation [_1#0, _2#1, _3#2]

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831)
at org.apache.spark.sql.Dataset.join(Dataset.scala:843)
at org.apache.spark.sql.Dataset.join(Dataset.scala:807)
at App$.test1(App.scala:17)
at App$.main(App.scala:9)
at App.main(App.scala)

Please someone look into it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org