[jira] [Updated] (SPARK-13410) unionAll AnalysisException with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13410: Fix Version/s: 1.6.1 > unionAll AnalysisException with DataFrames containing UDT columns. > -- > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza >Assignee: Franklyn Dsouza > Labels: patch > Fix For: 1.6.1, 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. > *Note: this will work fine if you are unioning the dataframe with itself.* > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll AnalysisException with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13410: -- Target Version/s: (was: 1.6.0) > unionAll AnalysisException with DataFrames containing UDT columns. > -- > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. > *Note: this will work fine if you are unioning the dataframe with itself.* > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13410) unionAll AnalysisException with DataFrames containing UDT columns.
[ https://issues.apache.org/jira/browse/SPARK-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franklyn Dsouza updated SPARK-13410: Summary: unionAll AnalysisException with DataFrames containing UDT columns. (was: unionAll throws error with DataFrames containing UDT columns.) > unionAll AnalysisException with DataFrames containing UDT columns. > -- > > Key: SPARK-13410 > URL: https://issues.apache.org/jira/browse/SPARK-13410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Franklyn Dsouza > Labels: patch > Original Estimate: 3h > Remaining Estimate: 3h > > Unioning two DataFrames that contain UDTs fails with > {quote} > AnalysisException: u"unresolved operator 'Union;" > {quote} > I tracked this down to this line > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala#L202 > Which compares datatypes between the output attributes of both logical plans. > However for UDTs this will be a new instance of the UserDefinedType or > PythonUserDefinedType > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala#L158 > > So this equality check will check if the two instances are the same and since > they aren't references to a singleton this check fails. > *Note: this will work fine if you are unioning the dataframe with itself.* > I have a proposed patch for this which overrides the equality operator on the > two classes here: https://github.com/apache/spark/pull/11279 > Reproduction steps > {code} > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql import types > schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) > #note they need to be two separate dataframes > a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) > b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) > c = a.unionAll(b) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org