[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167736#comment-16167736 ] Hosur Narahari commented on SPARK-22021: If I just apply this function, I can't use it in spark's ML pipeline and will have to break it into 2 sub-pipelines where I've to apply a function(logic). But by providing a transformer we can make one single pipeline till end. Also it will be a one stop generic solution to get any derived feature. > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
Hosur Narahari created SPARK-22021: -- Summary: Add a feature transformation to accept a function and apply it on all rows of dataframe Key: SPARK-22021 URL: https://issues.apache.org/jira/browse/SPARK-22021 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.0 Reporter: Hosur Narahari More often we generate derived features in ML pipeline by doing some mathematical or other kind of operation on columns of dataframe like getting a total of few columns as a new column or if there is text field message and we want the length of message etc. We currently don't have an efficient way to handle such scenario in ML pipeline. By Providing a transformer which accepts a function and performs that on mentioned columns to generate output column of numerical type, user has the flexibility to derive features by applying any domain specific logic. Example: val function = "function(a,b) { return a+b;}" val transformer = new GenFuncTransformer().setInputCols(Array("v1", "v2")).setOutputCol("result").setFunction(function) val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") val result = transformer.transform(df) result.show v1 v2 result 1.0 2.0 3.0 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20093) Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one
Hosur Narahari created SPARK-20093: -- Summary: Exception when Joining dataframe with another dataframe generated by applying groupBy transformation on original one Key: SPARK-20093 URL: https://issues.apache.org/jira/browse/SPARK-20093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.2.0 Reporter: Hosur Narahari When we generate a dataframe by doing grouping, and perform join on original dataframe with aggregate column, we get AnalysisException. Below I've attached a piece of code and resulting exception to reproduce. Code: import org.apache.spark.sql.SparkSession object App { lazy val spark = SparkSession.builder.appName("Test").master("local").getOrCreate def main(args: Array[String]): Unit = { test1 } private def test1 { import org.apache.spark.sql.functions._ val df = spark.createDataFrame(Seq(("M",172,60), ("M", 170, 60), ("F", 155, 56), ("M", 160, 55), ("F", 150, 53))).toDF("gender", "height", "weight") val groupDF = df.groupBy("gender").agg(min("height").as("height")) groupDF.show() val out = groupDF.join(df, groupDF("height") <=> df("height")).select(df("gender"), df("height"), df("weight")) out.show } } When I ran above code, I got below exception: Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) height#8 missing from height#19,height#30,gender#29,weight#31,gender#7 in operator !Join Inner, (height#19 <=> height#8);; !Join Inner, (height#19 <=> height#8) :- Aggregate [gender#7], [gender#7, min(height#8) AS height#19] : +- Project [_1#0 AS gender#7, _2#1 AS height#8, _3#2 AS weight#9] : +- LocalRelation [_1#0, _2#1, _3#2] +- Project [_1#0 AS gender#29, _2#1 AS height#30, _3#2 AS weight#31] +- LocalRelation [_1#0, _2#1, _3#2] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:342) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:90) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:53) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2831) at org.apache.spark.sql.Dataset.join(Dataset.scala:843) at org.apache.spark.sql.Dataset.join(Dataset.scala:807) at App$.test1(App.scala:17) at App$.main(App.scala:9) at App.main(App.scala) Please someone look into it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org