[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:37 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue* but obviously isn't a solution. was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue.* > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sq
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:29 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} *doing extra.cache() before the filtering will fix the issue.* was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} doing extra.cache() before the filtering will fix the issue. > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:22 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} doing extra.cache() before the filtering will fix the issue. was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apach
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:24 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} doing extra.cache() before the filtering will fix the issue. was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} doing extra.cache() before the filtering will fix the issue. > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$an
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:21 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child. {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > `java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child.` {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNod
[jira] [Comment Edited] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761741#comment-15761741 ] Franklyn Dsouza edited comment on SPARK-18589 at 12/19/16 5:20 PM: --- The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > `java.lang.RuntimeException: Invalid PythonUDF (b#1L, c#6L), requires attributes from more than one child.` {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} was (Author: franklyndsouza): The sequence of steps that causes this are: {code} join two dataframes A and B > make a udf that uses one column from A and another from B > filter on column produced by udf > :nuke: {code} Here are some minimum steps to reproduce this issue in pyspark {code:python} from pyspark.sql import types from pyspark.sql import functions as F df1 = sqlCtx.createDataFrame([types.Row(a=1, b=2), types.Row(a=1, b=4)]) df2 = sqlCtx.createDataFrame([types.Row(a=1, c=12)]) joined = df1.join(df2, df1['a'] == df2['a']) extra = joined.withColumn('sum', F.udf(lambda a,b : a+b, types.IntegerType())(joined['b'], joined['c'])) filtered = extra.where(extra['sum'] < F.lit(10)).collect() {code} > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala