[jira] [Assigned] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
[ https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32508: --- Assignee: dzcxzl > Disallow empty part col values in partition spec before static partition > writing > > > Key: SPARK-32508 > URL: https://issues.apache.org/jira/browse/SPARK-32508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > > When writing to the current static partition, the partition field is empty, > and an error will be reported when all tasks are completed. > We can prevent such behavior before submitting the task. > > {code:java} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for > key d is null or empty; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) > at > org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32508) Disallow empty part col values in partition spec before static partition writing
[ https://issues.apache.org/jira/browse/SPARK-32508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32508. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29316 [https://github.com/apache/spark/pull/29316] > Disallow empty part col values in partition spec before static partition > writing > > > Key: SPARK-32508 > URL: https://issues.apache.org/jira/browse/SPARK-32508 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.1.0 > > > When writing to the current static partition, the partition field is empty, > and an error will be reported when all tasks are completed. > We can prevent such behavior before submitting the task. > > {code:java} > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for > key d is null or empty; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:113) > at > org.apache.spark.sql.hive.HiveExternalCatalog.getPartitionOption(HiveExternalCatalog.scala:1212) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getPartitionOption(ExternalCatalogWithListener.scala:240) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:276) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197406#comment-17197406 ] Terry Kim commented on SPARK-29900: --- [~laurikoobas] Can you share a repro example? > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32189) Development - Setting up PyCharm
[ https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197402#comment-17197402 ] Apache Spark commented on SPARK-32189: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29781 > Development - Setting up PyCharm > > > Key: SPARK-32189 > URL: https://issues.apache.org/jira/browse/SPARK-32189 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > > PyCharm is probably one of the most widely used IDE for Python development. > We should document a standard way so users can easily set up. > Very rough example: > https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark > but it should be more in details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32189) Development - Setting up PyCharm
[ https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32189: Assignee: Apache Spark (was: Haejoon Lee) > Development - Setting up PyCharm > > > Key: SPARK-32189 > URL: https://issues.apache.org/jira/browse/SPARK-32189 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > PyCharm is probably one of the most widely used IDE for Python development. > We should document a standard way so users can easily set up. > Very rough example: > https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark > but it should be more in details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32189) Development - Setting up PyCharm
[ https://issues.apache.org/jira/browse/SPARK-32189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32189: Assignee: Haejoon Lee (was: Apache Spark) > Development - Setting up PyCharm > > > Key: SPARK-32189 > URL: https://issues.apache.org/jira/browse/SPARK-32189 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > > PyCharm is probably one of the most widely used IDE for Python development. > We should document a standard way so users can easily set up. > Very rough example: > https://hyukjin-spark.readthedocs.io/en/latest/development/developer-tools.html#setup-pycharm-with-pyspark > but it should be more in details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197395#comment-17197395 ] Lauri Koobas commented on SPARK-29900: -- The problem actually arose from using the `SQLContext.tableNames` that does NOT filter out any temporary views even if you pass it a database name. > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32186) Development - Debugging
[ https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197393#comment-17197393 ] Apache Spark commented on SPARK-32186: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29781 > Development - Debugging > --- > > Key: SPARK-32186 > URL: https://issues.apache.org/jira/browse/SPARK-32186 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > 1. Python Profiler: > https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html, > > {code} > >>> sc._conf.set("spark.python.profile", "true") > >>> rdd = sc.parallelize(range(100)).map(str) > >>> rdd.count() > 100 > >>> sc.show_profiles() > {code} > 2. Python Debugger, e.g.) > https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter > \(?\) > 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\) > 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32186) Development - Debugging
[ https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197391#comment-17197391 ] Apache Spark commented on SPARK-32186: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29781 > Development - Debugging > --- > > Key: SPARK-32186 > URL: https://issues.apache.org/jira/browse/SPARK-32186 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > 1. Python Profiler: > https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html, > > {code} > >>> sc._conf.set("spark.python.profile", "true") > >>> rdd = sc.parallelize(range(100)).map(str) > >>> rdd.count() > 100 > >>> sc.show_profiles() > {code} > 2. Python Debugger, e.g.) > https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter > \(?\) > 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\) > 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions
[ https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32903. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29776 [https://github.com/apache/spark/pull/29776] > GeneratePredicate should be able to eliminate common sub-expressions > > > Key: SPARK-32903 > URL: https://issues.apache.org/jira/browse/SPARK-32903 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such > codegen objects can eliminate common sub-expressions. But > {{GeneratePredicate}} currently doesn't do it. > We encounter a customer issue that a Filter pushed down through a Project > causes performance issue, compared with not pushed down case. The issue is > one expression used in Filter predicates are run many times. Due to the > complex schema, the query nodes are not wholestage codegen, so it runs > {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common > expression was run many time and became performance bottleneck. > {{GeneratePredicate}} should be able to eliminate common sub-expressions for > such case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27589) Spark file source V2
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197377#comment-17197377 ] Gengliang Wang commented on SPARK-27589: [~dongjoon] Thanks for reminder. I will revisit this part recently and create a proper JIRA ticket for it. > Spark file source V2 > > > Key: SPARK-27589 > URL: https://issues.apache.org/jira/browse/SPARK-27589 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Re-implement file sources with data source V2 API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32906) Struct field names should not change after normalizing floats
[ https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197374#comment-17197374 ] Apache Spark commented on SPARK-32906: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29780 > Struct field names should not change after normalizing floats > - > > Key: SPARK-32906 > URL: https://issues.apache.org/jira/browse/SPARK-32906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This ticket aims at fixing a minor bug when normalizing floats for struct > types; > {code} > scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec > scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") > scala> val agg = df.distinct() > scala> agg.explain() > == Physical Plan == > *(2) HashAggregate(keys=[k#40], functions=[]) > +- Exchange hashpartitioning(k#40, 200), true, [id=#62] >+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) > null else named_struct(col1, > knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], > functions=[]) > +- *(1) LocalTableScan [k#40] > scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: > HashAggregateExec => a.output.head } > scala> aggOutput.foreach { attr => println(attr.prettyJson) } > ### Final Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "_1", > ^^^ > "type" : "double", > "nullable" : false, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > ### Partial Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "col1", > > "type" : "double", > "nullable" : true, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32906) Struct field names should not change after normalizing floats
[ https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32906: Assignee: Apache Spark > Struct field names should not change after normalizing floats > - > > Key: SPARK-32906 > URL: https://issues.apache.org/jira/browse/SPARK-32906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > This ticket aims at fixing a minor bug when normalizing floats for struct > types; > {code} > scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec > scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") > scala> val agg = df.distinct() > scala> agg.explain() > == Physical Plan == > *(2) HashAggregate(keys=[k#40], functions=[]) > +- Exchange hashpartitioning(k#40, 200), true, [id=#62] >+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) > null else named_struct(col1, > knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], > functions=[]) > +- *(1) LocalTableScan [k#40] > scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: > HashAggregateExec => a.output.head } > scala> aggOutput.foreach { attr => println(attr.prettyJson) } > ### Final Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "_1", > ^^^ > "type" : "double", > "nullable" : false, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > ### Partial Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "col1", > > "type" : "double", > "nullable" : true, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32906) Struct field names should not change after normalizing floats
[ https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197373#comment-17197373 ] Apache Spark commented on SPARK-32906: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29780 > Struct field names should not change after normalizing floats > - > > Key: SPARK-32906 > URL: https://issues.apache.org/jira/browse/SPARK-32906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This ticket aims at fixing a minor bug when normalizing floats for struct > types; > {code} > scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec > scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") > scala> val agg = df.distinct() > scala> agg.explain() > == Physical Plan == > *(2) HashAggregate(keys=[k#40], functions=[]) > +- Exchange hashpartitioning(k#40, 200), true, [id=#62] >+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) > null else named_struct(col1, > knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], > functions=[]) > +- *(1) LocalTableScan [k#40] > scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: > HashAggregateExec => a.output.head } > scala> aggOutput.foreach { attr => println(attr.prettyJson) } > ### Final Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "_1", > ^^^ > "type" : "double", > "nullable" : false, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > ### Partial Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "col1", > > "type" : "double", > "nullable" : true, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32906) Struct field names should not change after normalizing floats
[ https://issues.apache.org/jira/browse/SPARK-32906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32906: Assignee: (was: Apache Spark) > Struct field names should not change after normalizing floats > - > > Key: SPARK-32906 > URL: https://issues.apache.org/jira/browse/SPARK-32906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > This ticket aims at fixing a minor bug when normalizing floats for struct > types; > {code} > scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec > scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") > scala> val agg = df.distinct() > scala> agg.explain() > == Physical Plan == > *(2) HashAggregate(keys=[k#40], functions=[]) > +- Exchange hashpartitioning(k#40, 200), true, [id=#62] >+- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) > null else named_struct(col1, > knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], > functions=[]) > +- *(1) LocalTableScan [k#40] > scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: > HashAggregateExec => a.output.head } > scala> aggOutput.foreach { attr => println(attr.prettyJson) } > ### Final Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "_1", > ^^^ > "type" : "double", > "nullable" : false, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > ### Partial Aggregate ### > [ { > "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", > "num-children" : 0, > "name" : "k", > "dataType" : { > "type" : "struct", > "fields" : [ { > "name" : "col1", > > "type" : "double", > "nullable" : true, > "metadata" : { } > } ] > }, > "nullable" : true, > "metadata" : { }, > "exprId" : { > "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", > "id" : 40, > "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" > }, > "qualifier" : [ ] > } ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2
[ https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197370#comment-17197370 ] Gengliang Wang commented on SPARK-28396: [~dongjoon] It's very likely that this will be done. But there is no detailed solution yet. Sorry, I should mark it as "later". Once there is a conclusion, or someone in the community comes up with a good proposal, let's reopen this. cc [~maxgekk] > Add PathCatalog for data source V2 > -- > > Key: SPARK-28396 > URL: https://issues.apache.org/jira/browse/SPARK-28396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Add PathCatalog for data source V2, so that: > 1. We can convert SaveMode in DataFrameWriter into catalog table operations, > instead of supporting SaveMode in file source V2. > 2. Support create-table SQL statements like "CREATE TABLE orc.'path'" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32906) Struct field names should not change after normalizing floats
Takeshi Yamamuro created SPARK-32906: Summary: Struct field names should not change after normalizing floats Key: SPARK-32906 URL: https://issues.apache.org/jira/browse/SPARK-32906 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2, 3.1.0 Reporter: Takeshi Yamamuro This ticket aims at fixing a minor bug when normalizing floats for struct types; {code} scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") scala> val agg = df.distinct() scala> agg.explain() == Physical Plan == *(2) HashAggregate(keys=[k#40], functions=[]) +- Exchange hashpartitioning(k#40, 200), true, [id=#62] +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1 AS k#40], functions=[]) +- *(1) LocalTableScan [k#40] scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head } scala> aggOutput.foreach { attr => println(attr.prettyJson) } ### Final Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "_1", ^^^ "type" : "double", "nullable" : false, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ### Partial Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "col1", "type" : "double", "nullable" : true, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32180) Getting Started - Installation
[ https://issues.apache.org/jira/browse/SPARK-32180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197358#comment-17197358 ] Apache Spark commented on SPARK-32180: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29779 > Getting Started - Installation > -- > > Key: SPARK-32180 > URL: https://issues.apache.org/jira/browse/SPARK-32180 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Rohit Mishra >Priority: Minor > Fix For: 3.1.0 > > > Example: > https://koalas.readthedocs.io/en/latest/getting_started/install.html > https://pandas.pydata.org/docs/getting_started/install.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32180) Getting Started - Installation
[ https://issues.apache.org/jira/browse/SPARK-32180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197357#comment-17197357 ] Apache Spark commented on SPARK-32180: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29779 > Getting Started - Installation > -- > > Key: SPARK-32180 > URL: https://issues.apache.org/jira/browse/SPARK-32180 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Rohit Mishra >Priority: Minor > Fix For: 3.1.0 > > > Example: > https://koalas.readthedocs.io/en/latest/getting_started/install.html > https://pandas.pydata.org/docs/getting_started/install.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big
[ https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197341#comment-17197341 ] wuyi commented on SPARK-32898: -- I think the issue is(for executorRunTimeMs): Before a task reaches to "taskStartTimeNs = System.nanoTime()", it might be already killed(e.g., by another successful attempt). So, taskStartTimeNs can not get initialized and remains 0. However, the executorRunTimeMs is calculated by "System.nanoTime() - taskStartTimeNs" in collectAccumulatorsAndResetStatusOnFailure, which is obviously a wrong big result when taskStartTimeNs = 0. I haven't taken a detail look for the submissionTime, but it sounds like it's a different issue? Though, it may be due to the same logic hole. I'd like to make a fix for the executorRunTimeMs first if [~linhongliu-db] doesn't mind. > totalExecutorRunTimeMs is too big > - > > Key: SPARK-32898 > URL: https://issues.apache.org/jira/browse/SPARK-32898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Priority: Major > > This might be because of incorrectly calculating executorRunTimeMs in > Executor.scala > The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can > be called when taskStartTimeNs is not set yet (it is 0). > As of now in master branch, here is the problematic code: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] > > There is a throw exception before this line. The catch branch still updates > the metric. > However the query shows as SUCCESSful. Maybe this task is speculative. Not > sure. > > submissionTime in LiveExecutionData may also have similar problem. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18409) LSH approxNearestNeighbors should use approxQuantile instead of sort
[ https://issues.apache.org/jira/browse/SPARK-18409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197338#comment-17197338 ] Apache Spark commented on SPARK-18409: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29778 > LSH approxNearestNeighbors should use approxQuantile instead of sort > > > Key: SPARK-18409 > URL: https://issues.apache.org/jira/browse/SPARK-18409 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Huaxin Gao >Priority: Major > Labels: bulk-closed > > LSHModel.approxNearestNeighbors sorts the full dataset on the hashDistance in > order to find a threshold. It should use approxQuantile instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message
[ https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32905: Assignee: (was: Apache Spark) > ApplicationMaster fails to receive UpdateDelegationTokens message > - > > Key: SPARK-32905 > URL: https://issues.apache.org/jira/browse/SPARK-32905 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, > launching executors on 0 of them. > 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > With a long-running application in kerberized mode, the AMEndpiont handles > the token updating message wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message
[ https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197333#comment-17197333 ] Apache Spark commented on SPARK-32905: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29777 > ApplicationMaster fails to receive UpdateDelegationTokens message > - > > Key: SPARK-32905 > URL: https://issues.apache.org/jira/browse/SPARK-32905 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, > launching executors on 0 of them. > 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > With a long-running application in kerberized mode, the AMEndpiont handles > the token updating message wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message
[ https://issues.apache.org/jira/browse/SPARK-32905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32905: Assignee: Apache Spark > ApplicationMaster fails to receive UpdateDelegationTokens message > - > > Key: SPARK-32905 > URL: https://issues.apache.org/jira/browse/SPARK-32905 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > {code:java} > 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, > launching executors on 0 of them. > 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error > org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) > does not implement 'receive' > at > org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > With a long-running application in kerberized mode, the AMEndpiont handles > the token updating message wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32905) ApplicationMaster fails to receive UpdateDelegationTokens message
Kent Yao created SPARK-32905: Summary: ApplicationMaster fails to receive UpdateDelegationTokens message Key: SPARK-32905 URL: https://issues.apache.org/jira/browse/SPARK-32905 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 3.0.1, 3.1.0 Reporter: Kent Yao {code:java} 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them. 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} With a long-running application in kerberized mode, the AMEndpiont handles the token updating message wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption
[ https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-26425: Assignee: Jungtaek Lim (was: Tathagata Das) > Add more constraint checks in file streaming source to avoid checkpoint > corruption > -- > > Key: SPARK-26425 > URL: https://issues.apache.org/jira/browse/SPARK-26425 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Tathagata Das >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > Two issues observed in production. > - HDFSMetadataLog.getLatest() tries to read older versions when it is not > able to read the latest listed version file. Not sure why this was done but > this should not be done. If the latest listed file is not readable, then > something is horribly wrong and we should fail rather than report an older > version as that can completely corrupt the checkpoint directory. > - FileStreamSource should check whether adding the a new batch to the > FileStreamSourceLog succeeded or not (similar to how StreamExecution checks > for the OffsetSeqLog) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26425) Add more constraint checks in file streaming source to avoid checkpoint corruption
[ https://issues.apache.org/jira/browse/SPARK-26425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-26425. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 25965 [https://github.com/apache/spark/pull/25965] > Add more constraint checks in file streaming source to avoid checkpoint > corruption > -- > > Key: SPARK-26425 > URL: https://issues.apache.org/jira/browse/SPARK-26425 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.1.0 > > > Two issues observed in production. > - HDFSMetadataLog.getLatest() tries to read older versions when it is not > able to read the latest listed version file. Not sure why this was done but > this should not be done. If the latest listed file is not readable, then > something is horribly wrong and we should fail rather than report an older > version as that can completely corrupt the checkpoint directory. > - FileStreamSource should check whether adding the a new batch to the > FileStreamSourceLog succeeded or not (similar to how StreamExecution checks > for the OffsetSeqLog) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32904) pyspark.mllib.evaluation.MulticlassMetrics needs to swap the results of precision( ) and recall( )
TinaLi created SPARK-32904: -- Summary: pyspark.mllib.evaluation.MulticlassMetrics needs to swap the results of precision( ) and recall( ) Key: SPARK-32904 URL: https://issues.apache.org/jira/browse/SPARK-32904 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 3.0.1 Reporter: TinaLi [https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html] *The values returned by the precision() and recall() methods of this API should be swapped.* Following is the example results I got when I run this API. It prints out precision metrics = MulticlassMetrics(predictionAndLabels) print (metrics.confusionMatrix().toArray()) print ("precision: ",metrics.precision(1)) print ("recall: ",metrics.recall(1)) [[36631. 2845.] [ 3839. 1610.]] precision: 0.3613916947250281 recall: 0.2954670581758121 predictions.select('prediction').agg(\{'prediction':'sum'}).show() |sum(prediction)| 5449.0| As you can see, my model predicted 5449 cases with label=1, and 1610 out of the 5449 cases are true positive, so precision should be 1610/5449=0.2954670581758121, but this API assigned the precision value to recall() method, which should be swapped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197250#comment-17197250 ] Terry Kim commented on SPARK-29900: --- When you run `show tables`, you get the `isTemporary` column, so I don't think there is any confusion. For `SQLContext.tableNames`, you can pass database name (e.g., "default"), which will filter out any temporary views. > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30283) V2 Command logical plan should use UnresolvedV2Relation for a table
[ https://issues.apache.org/jira/browse/SPARK-30283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197249#comment-17197249 ] Terry Kim commented on SPARK-30283: --- [~cloud_fan], this work stopped when alter table hit a bump in the road, but should we resume the work for commands like "drop table", etc., whose v2 behavior is different from v1? WDYT? > V2 Command logical plan should use UnresolvedV2Relation for a table > --- > > Key: SPARK-30283 > URL: https://issues.apache.org/jira/browse/SPARK-30283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > > For the following v2 commands, multi-part names are directly passed to the > command without looking up temp views, thus they are always resolved to > tables: > * DROP TABLE > * REFRESH TABLE > * RENAME TABLE > * REPLACE TABLE > They should be updated to have UnresolvedV2Relation such that temp views are > looked up first in Analyzer.ResolveTables. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27589) Spark file source V2
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197228#comment-17197228 ] Dongjoon Hyun commented on SPARK-27589: --- [~Gengliang.Wang]. What is the new JIRA issue for tracking `1. Make the File source V2 writer working` because SPARK-28396 is closed as 'Won't Fix'? > Spark file source V2 > > > Key: SPARK-27589 > URL: https://issues.apache.org/jira/browse/SPARK-27589 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Re-implement file sources with data source V2 API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2
[ https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197225#comment-17197225 ] Dongjoon Hyun commented on SPARK-28396: --- Hi, [~Gengliang.Wang] This is closed as 'Won't Fix'. So, there is no future plan even in Apache Spark 3.1.0? Also, cc [~cloud_fan] and [~smilegator] > Add PathCatalog for data source V2 > -- > > Key: SPARK-28396 > URL: https://issues.apache.org/jira/browse/SPARK-28396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Add PathCatalog for data source V2, so that: > 1. We can convert SaveMode in DataFrameWriter into catalog table operations, > instead of supporting SaveMode in file source V2. > 2. Support create-table SQL statements like "CREATE TABLE orc.'path'" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions
[ https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197223#comment-17197223 ] Apache Spark commented on SPARK-32903: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29776 > GeneratePredicate should be able to eliminate common sub-expressions > > > Key: SPARK-32903 > URL: https://issues.apache.org/jira/browse/SPARK-32903 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such > codegen objects can eliminate common sub-expressions. But > {{GeneratePredicate}} currently doesn't do it. > We encounter a customer issue that a Filter pushed down through a Project > causes performance issue, compared with not pushed down case. The issue is > one expression used in Filter predicates are run many times. Due to the > complex schema, the query nodes are not wholestage codegen, so it runs > {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common > expression was run many time and became performance bottleneck. > {{GeneratePredicate}} should be able to eliminate common sub-expressions for > such case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions
[ https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32903: Assignee: Apache Spark (was: L. C. Hsieh) > GeneratePredicate should be able to eliminate common sub-expressions > > > Key: SPARK-32903 > URL: https://issues.apache.org/jira/browse/SPARK-32903 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such > codegen objects can eliminate common sub-expressions. But > {{GeneratePredicate}} currently doesn't do it. > We encounter a customer issue that a Filter pushed down through a Project > causes performance issue, compared with not pushed down case. The issue is > one expression used in Filter predicates are run many times. Due to the > complex schema, the query nodes are not wholestage codegen, so it runs > {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common > expression was run many time and became performance bottleneck. > {{GeneratePredicate}} should be able to eliminate common sub-expressions for > such case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions
[ https://issues.apache.org/jira/browse/SPARK-32903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32903: Assignee: L. C. Hsieh (was: Apache Spark) > GeneratePredicate should be able to eliminate common sub-expressions > > > Key: SPARK-32903 > URL: https://issues.apache.org/jira/browse/SPARK-32903 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such > codegen objects can eliminate common sub-expressions. But > {{GeneratePredicate}} currently doesn't do it. > We encounter a customer issue that a Filter pushed down through a Project > causes performance issue, compared with not pushed down case. The issue is > one expression used in Filter predicates are run many times. Due to the > complex schema, the query nodes are not wholestage codegen, so it runs > {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common > expression was run many time and became performance bottleneck. > {{GeneratePredicate}} should be able to eliminate common sub-expressions for > such case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32903) GeneratePredicate should be able to eliminate common sub-expressions
L. C. Hsieh created SPARK-32903: --- Summary: GeneratePredicate should be able to eliminate common sub-expressions Key: SPARK-32903 URL: https://issues.apache.org/jira/browse/SPARK-32903 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Both {{GenerateMutableProjection}} and {{GenerateUnsafeProjection}}, such codegen objects can eliminate common sub-expressions. But {{GeneratePredicate}} currently doesn't do it. We encounter a customer issue that a Filter pushed down through a Project causes performance issue, compared with not pushed down case. The issue is one expression used in Filter predicates are run many times. Due to the complex schema, the query nodes are not wholestage codegen, so it runs {{Filter.doExecute}} and then call {{GeneratePredicate}}. The common expression was run many time and became performance bottleneck. {{GeneratePredicate}} should be able to eliminate common sub-expressions for such case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32295: Assignee: (was: Apache Spark) > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32295: Assignee: Apache Spark > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32295) Add not null and size > 0 filters before inner explode to benefit from predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-32295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis reopened SPARK-32295: > Add not null and size > 0 filters before inner explode to benefit from > predicate pushdown > - > > Key: SPARK-32295 > URL: https://issues.apache.org/jira/browse/SPARK-32295 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 3.1.0 >Reporter: Tanel Kiis >Priority: Major > Labels: performance > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24994) Add UnwrapCastInBinaryComparison optimizer to simplify literal types
[ https://issues.apache.org/jira/browse/SPARK-24994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197198#comment-17197198 ] Apache Spark commented on SPARK-24994: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29775 > Add UnwrapCastInBinaryComparison optimizer to simplify literal types > > > Key: SPARK-24994 > URL: https://issues.apache.org/jira/browse/SPARK-24994 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: liuxian >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > For this statement: select * from table1 where a = 100; > the data type of `a` is `smallint` , because the defaut data type of 100 is > `int` ,so the data type of 'a' is converted to `int`. > In this case, it does not support push down to parquet. > In our business, for our SQL statements, and we generally do not convert 100 > to `smallint`, We hope that it can support push down to parquet for this > situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32890) Pass all `sql/hive` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-32890: Assignee: Yang Jie > Pass all `sql/hive` module UTs in Scala 2.13 > > > Key: SPARK-32890 > URL: https://issues.apache.org/jira/browse/SPARK-32890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > There are only 4 test cases failed in sql hive module with cmd > > {code:java} > mvn clean install -DskipTests -pl sql/hive -am -Pscala-2.13 -Phive > mvn clean test -pl sql/hive -Pscala-2.13 -Phive{code} > > The failed cases as follow: > * HiveSchemaInferenceSuite (1 FAILED) > * HiveSparkSubmitSuite (1 FAILED) > * StatisticsSuite (1 FAILED) > * HiveDDLSuite (1 FAILED) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32890) Pass all `sql/hive` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-32890. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29760 [https://github.com/apache/spark/pull/29760] > Pass all `sql/hive` module UTs in Scala 2.13 > > > Key: SPARK-32890 > URL: https://issues.apache.org/jira/browse/SPARK-32890 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.1.0 > > > There are only 4 test cases failed in sql hive module with cmd > > {code:java} > mvn clean install -DskipTests -pl sql/hive -am -Pscala-2.13 -Phive > mvn clean test -pl sql/hive -Pscala-2.13 -Phive{code} > > The failed cases as follow: > * HiveSchemaInferenceSuite (1 FAILED) > * HiveSparkSubmitSuite (1 FAILED) > * StatisticsSuite (1 FAILED) > * HiveDDLSuite (1 FAILED) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27589) Spark file source V2
[ https://issues.apache.org/jira/browse/SPARK-27589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197143#comment-17197143 ] Thomas Graves commented on SPARK-27589: --- somewhat related, I was looking through the v2 code for parquet and I don't see anything for bucketing, is bucketing supported with the V2 api? > Spark file source V2 > > > Key: SPARK-27589 > URL: https://issues.apache.org/jira/browse/SPARK-27589 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Re-implement file sources with data source V2 API -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32897) SparkSession.builder.getOrCreate should not show deprecation warning of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-32897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-32897: -- Affects Version/s: (was: 2.4.7) > SparkSession.builder.getOrCreate should not show deprecation warning of > SQLContext > -- > > Key: SPARK-32897 > URL: https://issues.apache.org/jira/browse/SPARK-32897 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > In PySpark shell: > {code} > import warnings > from pyspark.sql import SparkSession, SQLContext > warnings.simplefilter('always', DeprecationWarning) > spark.stop() > SparkSession.builder.getOrCreate() > {code} > shows a deprecation warning from {{SQLContext}} > {code} > /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated > in 3.0.0. Use SparkSession.builder.getOrCreate() instead. > DeprecationWarning) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32897) SparkSession.builder.getOrCreate should not show deprecation warning of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-32897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-32897. --- Fix Version/s: 3.1.0 3.0.2 Assignee: Hyukjin Kwon Resolution: Fixed Issue resolved by pull request 29768 https://github.com/apache/spark/pull/29768 > SparkSession.builder.getOrCreate should not show deprecation warning of > SQLContext > -- > > Key: SPARK-32897 > URL: https://issues.apache.org/jira/browse/SPARK-32897 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.7, 3.0.1, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > In PySpark shell: > {code} > import warnings > from pyspark.sql import SparkSession, SQLContext > warnings.simplefilter('always', DeprecationWarning) > spark.stop() > SparkSession.builder.getOrCreate() > {code} > shows a deprecation warning from {{SQLContext}} > {code} > /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated > in 3.0.0. Use SparkSession.builder.getOrCreate() instead. > DeprecationWarning) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32816) Planner error when aggregating multiple distinct DECIMAL columns
[ https://issues.apache.org/jira/browse/SPARK-32816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32816: --- Assignee: Linhong Liu > Planner error when aggregating multiple distinct DECIMAL columns > > > Key: SPARK-32816 > URL: https://issues.apache.org/jira/browse/SPARK-32816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.1.0 > > > Running different DISTINCT decimal aggregations causes a query planner error: > {code:java} > java.lang.RuntimeException: You hit a query analyzer bug. Please report your > query to Spark user mailing list. > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:473) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:97) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82) > at > scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162) > at > scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > {code} > example failing query > {code:java} > import org.apache.spark.util.Utils > // Changing decimal(9, 0) to decimal(8, 0) fixes the problem. Root cause > seems to have to do with > // UnscaledValue being used in one of the expressions but not the other. > val df = spark.range(0, 5, 1, 1).selectExpr( > "id", > "cast(id as decimal(9, 0)) as ss_ext_list_price") > val cacheDir = Utils.createTempDir().getCanonicalPath > df.write.parquet(cacheDir) > spark.read.parquet(cacheDir).createOrReplaceTempView("test_table") > spark.sql(""" > select > avg(distinct ss_ext_list_price), sum(distinct ss_ext_list_price) > from test_table""").explain > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32816) Planner error when aggregating multiple distinct DECIMAL columns
[ https://issues.apache.org/jira/browse/SPARK-32816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32816. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29673 [https://github.com/apache/spark/pull/29673] > Planner error when aggregating multiple distinct DECIMAL columns > > > Key: SPARK-32816 > URL: https://issues.apache.org/jira/browse/SPARK-32816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Priority: Major > Fix For: 3.1.0 > > > Running different DISTINCT decimal aggregations causes a query planner error: > {code:java} > java.lang.RuntimeException: You hit a query analyzer bug. Please report your > query to Spark user mailing list. > at scala.sys.package$.error(package.scala:30) > at > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:473) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:67) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:97) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:82) > at > scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162) > at > scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > {code} > example failing query > {code:java} > import org.apache.spark.util.Utils > // Changing decimal(9, 0) to decimal(8, 0) fixes the problem. Root cause > seems to have to do with > // UnscaledValue being used in one of the expressions but not the other. > val df = spark.range(0, 5, 1, 1).selectExpr( > "id", > "cast(id as decimal(9, 0)) as ss_ext_list_price") > val cacheDir = Utils.createTempDir().getCanonicalPath > df.write.parquet(cacheDir) > spark.read.parquet(cacheDir).createOrReplaceTempView("test_table") > spark.sql(""" > select > avg(distinct ss_ext_list_price), sum(distinct ss_ext_list_price) > from test_table""").explain > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah closed SPARK-32888. -- Resolved by adding documentation > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197073#comment-17197073 ] L. C. Hsieh commented on SPARK-32888: - Yes, there is difference. But it is due to reading file and reading from RDD/Dataset. When reading file, we definitely know which line is first line, we can remove it. When we read from RDD/Dataset, we don't know which lines are first files in the reading files. So the best we can do, is just remove the lines same as the first line in the RDD/Dataset. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197069#comment-17197069 ] Punit Shah commented on SPARK-32888: Thank you for your reply [~viirya] However what I've noticed is that NO lines are removed during a straight csv import. That is why my comment was put forth. There is a difference in the results when reading from csv directly and when reading from rdd. Rdds cause removal of lines, while straight csv don't. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32898) totalExecutorRunTimeMs is too big
[ https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-32898: - Description: This might be because of incorrectly calculating executorRunTimeMs in Executor.scala The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can be called when taskStartTimeNs is not set yet (it is 0). As of now in master branch, here is the problematic code: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] There is a throw exception before this line. The catch branch still updates the metric. However the query shows as SUCCESSful. Maybe this task is speculative. Not sure. submissionTime in LiveExecutionData may also have similar problem. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] was: This might be because of incorrectly calculating executorRunTimeMs in Executor.scala The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can be called when taskStartTimeNs is not set yet (it is 0). As of now in master branch, here is the problematic code: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] There is a throw exception before this line. The catch branch still updates the metric. However the query shows as SUCCESSful in QPL. Maybe this task is speculative. Not sure. submissionTime in LiveExecutionData may also have similar problem. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > totalExecutorRunTimeMs is too big > - > > Key: SPARK-32898 > URL: https://issues.apache.org/jira/browse/SPARK-32898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Priority: Major > > This might be because of incorrectly calculating executorRunTimeMs in > Executor.scala > The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can > be called when taskStartTimeNs is not set yet (it is 0). > As of now in master branch, here is the problematic code: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] > > There is a throw exception before this line. The catch branch still updates > the metric. > However the query shows as SUCCESSful. Maybe this task is speculative. Not > sure. > > submissionTime in LiveExecutionData may also have similar problem. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197046#comment-17197046 ] L. C. Hsieh commented on SPARK-32888: - Reading csv files is simple. We can just remove first line. But when we read RDD of string of Dataset of String containing CSV lines, we don't know which lines are the first lines in files. So what we can do is just remove the lines same as the header. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32850) Simply the RPC message flow of decommission
[ https://issues.apache.org/jira/browse/SPARK-32850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32850: --- Assignee: wuyi > Simply the RPC message flow of decommission > --- > > Key: SPARK-32850 > URL: https://issues.apache.org/jira/browse/SPARK-32850 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > The RPC message flow of decommission is kind of messy because multiple use > cases mixed together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32850) Simply the RPC message flow of decommission
[ https://issues.apache.org/jira/browse/SPARK-32850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32850. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29722 [https://github.com/apache/spark/pull/29722] > Simply the RPC message flow of decommission > --- > > Key: SPARK-32850 > URL: https://issues.apache.org/jira/browse/SPARK-32850 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > The RPC message flow of decommission is kind of messy because multiple use > cases mixed together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985 ] Punit Shah edited comment on SPARK-32888 at 9/16/20, 2:55 PM: -- Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. [~viirya] would appreciate comment thanks... was (Author: bullsoverbears): Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32706) Poor performance when casting invalid decimal string to decimal type
[ https://issues.apache.org/jira/browse/SPARK-32706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32706. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29731 [https://github.com/apache/spark/pull/29731] > Poor performance when casting invalid decimal string to decimal type > > > Key: SPARK-32706 > URL: https://issues.apache.org/jira/browse/SPARK-32706 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.1.0 > > Attachments: part-0.parquet > > > How to reproduce this issue: > {code:java} > spark.read.parquet("/path/to/part-0.parquet").selectExpr("cast(bd as > decimal(18, 0)) as x").write.mode("overwrite").save("/tmp/spark/decimal") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32706) Poor performance when casting invalid decimal string to decimal type
[ https://issues.apache.org/jira/browse/SPARK-32706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32706: --- Assignee: Yuming Wang > Poor performance when casting invalid decimal string to decimal type > > > Key: SPARK-32706 > URL: https://issues.apache.org/jira/browse/SPARK-32706 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Attachments: part-0.parquet > > > How to reproduce this issue: > {code:java} > spark.read.parquet("/path/to/part-0.parquet").selectExpr("cast(bd as > decimal(18, 0)) as x").write.mode("overwrite").save("/tmp/spark/decimal") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985 ] Punit Shah commented on SPARK-32888: Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32902) Logging plan changes for AQE
[ https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32902: Assignee: Apache Spark > Logging plan changes for AQE > > > Key: SPARK-32902 > URL: https://issues.apache.org/jira/browse/SPARK-32902 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Major > > Recently, we added code to log plan changes in the preparation phase in > QueryExecution for execution (SPARK-32704). This ticket targets at applying > the same fix for logging plan changes in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32902) Logging plan changes for AQE
[ https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196977#comment-17196977 ] Apache Spark commented on SPARK-32902: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29774 > Logging plan changes for AQE > > > Key: SPARK-32902 > URL: https://issues.apache.org/jira/browse/SPARK-32902 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > Recently, we added code to log plan changes in the preparation phase in > QueryExecution for execution (SPARK-32704). This ticket targets at applying > the same fix for logging plan changes in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32902) Logging plan changes for AQE
[ https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32902: Assignee: (was: Apache Spark) > Logging plan changes for AQE > > > Key: SPARK-32902 > URL: https://issues.apache.org/jira/browse/SPARK-32902 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > Recently, we added code to log plan changes in the preparation phase in > QueryExecution for execution (SPARK-32704). This ticket targets at applying > the same fix for logging plan changes in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32902) Logging plan changes for AQE
[ https://issues.apache.org/jira/browse/SPARK-32902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196978#comment-17196978 ] Apache Spark commented on SPARK-32902: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29774 > Logging plan changes for AQE > > > Key: SPARK-32902 > URL: https://issues.apache.org/jira/browse/SPARK-32902 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Major > > Recently, we added code to log plan changes in the preparation phase in > QueryExecution for execution (SPARK-32704). This ticket targets at applying > the same fix for logging plan changes in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
[ https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196976#comment-17196976 ] Apache Spark commented on SPARK-32287: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29773 > Flaky Test: ExecutorAllocationManagerSuite.add executors default profile > > > Key: SPARK-32287 > URL: https://issues.apache.org/jira/browse/SPARK-32287 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > This test becomes flaky in Github Actions, see: > https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509 > {code:java} > [info] - add executors default profile *** FAILED *** (33 milliseconds) > [info] 4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > [info] at > org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32287) Flaky Test: ExecutorAllocationManagerSuite.add executors default profile
[ https://issues.apache.org/jira/browse/SPARK-32287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196975#comment-17196975 ] Apache Spark commented on SPARK-32287: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29773 > Flaky Test: ExecutorAllocationManagerSuite.add executors default profile > > > Key: SPARK-32287 > URL: https://issues.apache.org/jira/browse/SPARK-32287 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > This test becomes flaky in Github Actions, see: > https://github.com/apache/spark/pull/29072/checks?check_run_id=861689509 > {code:java} > [info] - add executors default profile *** FAILED *** (33 milliseconds) > [info] 4 did not equal 2 (ExecutorAllocationManagerSuite.scala:132) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > [info] at > org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$7(ExecutorAllocationManagerSuite.scala:132) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > [info] at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > [info] at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > [info] at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) > [info] at > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) > [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:59) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229) > [info] ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32902) Logging plan changes for AQE
Takeshi Yamamuro created SPARK-32902: Summary: Logging plan changes for AQE Key: SPARK-32902 URL: https://issues.apache.org/jira/browse/SPARK-32902 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro Recently, we added code to log plan changes in the preparation phase in QueryExecution for execution (SPARK-32704). This ticket targets at applying the same fix for logging plan changes in AQE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32898) totalExecutorRunTimeMs is too big
[ https://issues.apache.org/jira/browse/SPARK-32898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196972#comment-17196972 ] Thomas Graves commented on SPARK-32898: --- [~linhongliu-db] can you please provide more of a description. You say this was too big, did it cause an error for your job or you just noticed the time was to big? Do you have a reproducible case? You have some details there about what might be wrong with taskStartTimeNs possibly not initialized, if you can give more details there in generally that would be great as its a bit hard to follow your description. If you have spent the time to debug you and have a fix in mind please feel free to put up a pull request. > totalExecutorRunTimeMs is too big > - > > Key: SPARK-32898 > URL: https://issues.apache.org/jira/browse/SPARK-32898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Linhong Liu >Priority: Major > > This might be because of incorrectly calculating executorRunTimeMs in > Executor.scala > The function collectAccumulatorsAndResetStatusOnFailure(taskStartTimeNs) can > be called when taskStartTimeNs is not set yet (it is 0). > As of now in master branch, here is the problematic code: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L470] > > There is a throw exception before this line. The catch branch still updates > the metric. > However the query shows as SUCCESSful in QPL. Maybe this task is speculative. > Not sure. > > submissionTime in LiveExecutionData may also have similar problem. > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala#L449] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32894) Timestamp cast in exernal orc table
[ https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32894: -- Summary: Timestamp cast in exernal orc table (was: Timestamp cast in exernal ocr table) > Timestamp cast in exernal orc table > --- > > Key: SPARK-32894 > URL: https://issues.apache.org/jira/browse/SPARK-32894 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.0 > Environment: Spark 3.0.0 > Java 1.8 > Hadoop 3.3.0 > Hive 3.1.2 > Python 3.7 (from pyspark) >Reporter: Grigory Skvortsov >Priority: Major > > I have the external hive table stored as orc. I want to work with timestamp > column in my table using pyspark. > For example, I try this: > spark.sql('select id, time_ from mydb.table1`).show() > > Py4JJavaError: An error occurred while calling o2877.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: > org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148) > at > org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152) > at > org.apache.spark.s
[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32635: -- Labels: correct (was: ) > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Priority: Blocker > Labels: correct > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32635: -- Labels: correctness (was: correct) > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Priority: Blocker > Labels: correctness > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-32635: -- Priority: Blocker (was: Major) > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Priority: Blocker > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32185) User Guide - Monitoring
[ https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32185: Assignee: Abhijeet Prasad > User Guide - Monitoring > --- > > Key: SPARK-32185 > URL: https://issues.apache.org/jira/browse/SPARK-32185 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Abhijeet Prasad >Priority: Major > > Monitoring. We should focus on how to monitor PySpark jobs. > - Custom Worker, see also > https://github.com/apache/spark/tree/master/python/test_coverage to enable > test coverage that include worker sides too. > - Sentry Support \(?\) > https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark > - Link back https://spark.apache.org/docs/latest/monitoring.html . > - ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
[ https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196931#comment-17196931 ] Apache Spark commented on SPARK-32900: -- User 'tomvanbussel' has created a pull request for this issue: https://github.com/apache/spark/pull/29772 > UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in > the input and radix sorting is used. > > > Key: SPARK-32900 > URL: https://issues.apache.org/jira/browse/SPARK-32900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Tom van Bussel >Priority: Major > > In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has > spilled already it checks whether {{upstream}} is an instance of > {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by > SPARK-14851) and there are NULLs in the input however, upstream will be an > instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should > still be spilled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
[ https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32900: Assignee: (was: Apache Spark) > UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in > the input and radix sorting is used. > > > Key: SPARK-32900 > URL: https://issues.apache.org/jira/browse/SPARK-32900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Tom van Bussel >Priority: Major > > In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has > spilled already it checks whether {{upstream}} is an instance of > {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by > SPARK-14851) and there are NULLs in the input however, upstream will be an > instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should > still be spilled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
[ https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32900: Assignee: Apache Spark > UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in > the input and radix sorting is used. > > > Key: SPARK-32900 > URL: https://issues.apache.org/jira/browse/SPARK-32900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Tom van Bussel >Assignee: Apache Spark >Priority: Major > > In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has > spilled already it checks whether {{upstream}} is an instance of > {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by > SPARK-14851) and there are NULLs in the input however, upstream will be an > instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should > still be spilled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196910#comment-17196910 ] Apache Spark commented on SPARK-32635: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/29771 > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Priority: Major > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32635: Assignee: Apache Spark > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Assignee: Apache Spark >Priority: Major > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32635) When pyspark.sql.functions.lit() function is used with dataframe cache, it returns wrong result
[ https://issues.apache.org/jira/browse/SPARK-32635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32635: Assignee: (was: Apache Spark) > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > --- > > Key: SPARK-32635 > URL: https://issues.apache.org/jira/browse/SPARK-32635 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Vinod KC >Priority: Major > > When pyspark.sql.functions.lit() function is used with dataframe cache, it > returns wrong result > eg:lit() function with cache() function. > --- > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner").cache() > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Wrong result > {code} > > Output > --- > {code:java} > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Wrong result, instead of 2, got 1 > ++ > |col2| > ++ > | 1| > ++ > ++{code} > lit() function without cache() function. > {code:java} > from pyspark.sql import Row > from pyspark.sql import functions as F > df_1 = spark.createDataFrame(Row(**x) for x in [{'col1': > 'b'}]).withColumn("col2", F.lit(str(2))) > df_2 = spark.createDataFrame(Row(**x) for x in [{'col1': 'a', 'col3': > 8}]).withColumn("col2", F.lit(str(1))) > df_3 = spark.createDataFrame(Row(**x) for x in [{'col1': 'b', 'col3': > 9}]).withColumn("col2", F.lit(str(2))) > df_23 = df_2.union(df_3) > df_4 = spark.createDataFrame(Row(**x) for x in [{'col3': > 9}]).withColumn("col2", F.lit(str(2))) > sel_col3 = df_23.select('col3', 'col2') > df_4 = df_4.join(sel_col3, on=['col3', 'col2'], how = "inner") > df_23_a = df_23.join(df_1, on=["col1", 'col2'], how="inner") > finaldf = df_23_a.join(df_4, on=['col2', 'col3'], > how='left').filter(F.col('col3') == 9) > finaldf.show() > finaldf.select('col2').show() #Correct result > {code} > > Output > {code:java} > -- > >>> finaldf.show() > ++++ > |col2|col3|col1| > ++++ > | 2| 9| b| > ++++ > >>> finaldf.select('col2').show() #Correct result, when df_23_a is not cached > ++ > |col2| > ++ > | 2| > ++ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32814) Metaclasses are broken for a few classes in Python 3
[ https://issues.apache.org/jira/browse/SPARK-32814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32814: Assignee: Maciej Szymkiewicz > Metaclasses are broken for a few classes in Python 3 > > > Key: SPARK-32814 > URL: https://issues.apache.org/jira/browse/SPARK-32814 > Project: Spark > Issue Type: Bug > Components: ML, PySpark, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > As of Python 3 {{__metaclass__}} is no longer supported > https://www.python.org/dev/peps/pep-3115/. > However, we have multiple classes which where never migrated to Python 3 > compatible syntax: > - A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}} > - Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}} > As a result some functionalities are broken in Python 3. For example > {code:python} > >>> from pyspark.sql.types import BooleanType > >>> > >>> > >>> BooleanType() is BooleanType() > >>> > >>> > False > {code} > or > {code:python} > >>> import inspect > >>> > >>> > >>> from pyspark.ml import Estimator > >>> > >>> > >>> inspect.isabstract(Estimator) > >>> > >>> > False > {code} > where in both cases we expect to see {{True}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32814) Metaclasses are broken for a few classes in Python 3
[ https://issues.apache.org/jira/browse/SPARK-32814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32814. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29664 [https://github.com/apache/spark/pull/29664] > Metaclasses are broken for a few classes in Python 3 > > > Key: SPARK-32814 > URL: https://issues.apache.org/jira/browse/SPARK-32814 > Project: Spark > Issue Type: Bug > Components: ML, PySpark, SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > As of Python 3 {{__metaclass__}} is no longer supported > https://www.python.org/dev/peps/pep-3115/. > However, we have multiple classes which where never migrated to Python 3 > compatible syntax: > - A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}} > - Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}} > As a result some functionalities are broken in Python 3. For example > {code:python} > >>> from pyspark.sql.types import BooleanType > >>> > >>> > >>> BooleanType() is BooleanType() > >>> > >>> > False > {code} > or > {code:python} > >>> import inspect > >>> > >>> > >>> from pyspark.ml import Estimator > >>> > >>> > >>> inspect.isabstract(Estimator) > >>> > >>> > False > {code} > where in both cases we expect to see {{True}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32835) Add withField to PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-32835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32835: Assignee: Adam Binford > Add withField to PySpark Column class > - > > Key: SPARK-32835 > URL: https://issues.apache.org/jira/browse/SPARK-32835 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > > The withField functionality was added to the Scala API, we simply need a > Python wrapper to call it. > Will need to do the same for dropField once it gets added back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32835) Add withField to PySpark Column class
[ https://issues.apache.org/jira/browse/SPARK-32835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32835. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29699 [https://github.com/apache/spark/pull/29699] > Add withField to PySpark Column class > - > > Key: SPARK-32835 > URL: https://issues.apache.org/jira/browse/SPARK-32835 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.1.0 > > > The withField functionality was added to the Scala API, we simply need a > Python wrapper to call it. > Will need to do the same for dropField once it gets added back. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32888: Assignee: L. C. Hsieh > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32888. -- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29765 [https://github.com/apache/spark/pull/29765] > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32901) UnsafeExternalSorter may cause a SparkOutOfMemoryError to be thrown while spilling
Tom van Bussel created SPARK-32901: -- Summary: UnsafeExternalSorter may cause a SparkOutOfMemoryError to be thrown while spilling Key: SPARK-32901 URL: https://issues.apache.org/jira/browse/SPARK-32901 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 2.4.7 Reporter: Tom van Bussel Consider the following sequence of events: # {{UnsafeExternalSorter}} runs out of space in its pointer array and attempts to allocate a large array to replace the current one. # {{TaskMemoryManager}} tries to allocate the memory backing the large array using {{MemoryManager}}, but {{MemoryManager}} is only willing to return most but not all of the memory requested. # {{TaskMemoryManager}} asks {{UnsafeExternalSorter}} to spill, which causes {{UnsafeExternalSorter}} to spill the current run to disk, to free its record pages and to reset its {{UnsafeInMemorySorter}}. # {{UnsafeInMemorySorter}} frees its pointer array, and tries to allocate a new small pointer array. # {{TaskMemoryManager}} tries to allocate the memory backing the small array using {{MemoryManager}}, but {{MemoryManager}} is unwilling to give it any memory, as the {{TaskMemoryManager}} is still holding on to the memory it got for the large array. # {{TaskMemoryManager}} again asks {{UnsafeExternalSorter}} to spill, but this time there is nothing to spill. # {{UnsafeInMemorySorter}} receives less memory than it requested, and causes a {{SparkOutOfMemoryError}} to be thrown, which causes the current task to fail. A simple way to fix this is to avoid allocating a new array in {{UnsafeInMemorySorter.reset()}} and to do this on-demand instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
[ https://issues.apache.org/jira/browse/SPARK-32900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom van Bussel updated SPARK-32900: --- Description: In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has spilled already it checks whether {{upstream}} is an instance of {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by SPARK-14851) and there are NULLs in the input however, upstream will be an instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled. (was: In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has spilled already it checks whether {{upstream}} is an instance of {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used and there are NULLs in the input however, upstream will be an instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled.) > UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in > the input and radix sorting is used. > > > Key: SPARK-32900 > URL: https://issues.apache.org/jira/browse/SPARK-32900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.7, 3.0.1 >Reporter: Tom van Bussel >Priority: Major > > In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has > spilled already it checks whether {{upstream}} is an instance of > {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used (added by > SPARK-14851) and there are NULLs in the input however, upstream will be an > instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should > still be spilled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32900) UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used.
Tom van Bussel created SPARK-32900: -- Summary: UnsafeExternalSorter.SpillableIterator cannot spill when there are NULLs in the input and radix sorting is used. Key: SPARK-32900 URL: https://issues.apache.org/jira/browse/SPARK-32900 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 2.4.7 Reporter: Tom van Bussel In order to determine whether {{UnsafeExternalSorter.SpillableIterator}} has spilled already it checks whether {{upstream}} is an instance of {{UnsafeInMemorySorter.SortedIterator}}. When radix sorting is used and there are NULLs in the input however, upstream will be an instance of {{UnsafeExternalSorter.ChainedIterator}} instead, but should still be spilled. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196812#comment-17196812 ] Lauri Koobas edited comment on SPARK-29900 at 9/16/20, 9:18 AM: Bringing up a related point - {code:java} show tables in {code} always shows also temporary views. Same with {code:java} sqlContext.tableNames("database"){code} This seems counter-intuitive. There should be an additional flag to either include or exclude temporary views from that list. was (Author: laurikoobas): Bringing up a related point - `show tables in ` always shows also temporary views. Same with `sqlContext.tableNames("database")`. This seems counter-intuitive. There should be an additional flag to either include or exclude temporary views from that list. > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29900) make relation lookup behavior consistent within Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196812#comment-17196812 ] Lauri Koobas commented on SPARK-29900: -- Bringing up a related point - `show tables in ` always shows also temporary views. Same with `sqlContext.tableNames("database")`. This seems counter-intuitive. There should be an additional flag to either include or exclude temporary views from that list. > make relation lookup behavior consistent within Spark SQL > - > > Key: SPARK-29900 > URL: https://issues.apache.org/jira/browse/SPARK-29900 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > Currently, Spark has 2 different relation resolution behaviors: > 1. try to look up temp view first, then try table/persistent view. > 2. try to look up table/persistent view. > The first behavior is used in SELECT, INSERT and a few commands that support > views, like DESC TABLE. > The second behavior is used in most commands. > It's confusing to have inconsistent relation resolution behaviors, and the > benefit is super small. It's only useful when there are temp view and table > with the same name, but users can easily use qualified table name to > disambiguate. > In postgres, the relation resolution behavior is consistent > {code} > cloud0fan=# create schema s1; > CREATE SCHEMA > cloud0fan=# SET search_path TO s1; > SET > cloud0fan=# create table s1.t (i int); > CREATE TABLE > cloud0fan=# insert into s1.t values (1); > INSERT 0 1 > # access table with qualified name > cloud0fan=# select * from s1.t; > i > --- > 1 > (1 row) > # access table with single name > cloud0fan=# select * from t; > i > --- > 1 > (1 rows) > # create a temp view with conflicting name > cloud0fan=# create temp view t as select 2 as i; > CREATE VIEW > # same as spark, temp view has higher proirity during resolution > cloud0fan=# select * from t; > i > --- > 2 > (1 row) > # DROP TABLE also resolves temp view first > cloud0fan=# drop table t; > ERROR: "t" is not a table > # DELETE also resolves temp view first > cloud0fan=# delete from t where i = 0; > ERROR: cannot delete from view "t" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32894) Timestamp cast in exernal ocr table
[ https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196783#comment-17196783 ] Grigory Skvortsov edited comment on SPARK-32894 at 9/16/20, 8:32 AM: - >From hiveCli using following code: CREATE EXTERNAL TABLE IF NOT EXISTS testtable( HOST String, ID bigint, TYPE int, TIME_ TIMESTAMP, PARTITIONED BY (p1 String, p2 String) CLUSTERED BY (host) INTO 5 BUCKETS STORED AS ORC LOCATION '/user/hive/warehouse/testtable'; was (Author: skvortsovg): >From hiveCli using following code: > Timestamp cast in exernal ocr table > --- > > Key: SPARK-32894 > URL: https://issues.apache.org/jira/browse/SPARK-32894 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.0 > Environment: Spark 3.0.0 > Java 1.8 > Hadoop 3.3.0 > Hive 3.1.2 > Python 3.7 (from pyspark) >Reporter: Grigory Skvortsov >Priority: Major > > I have the external hive table stored as orc. I want to work with timestamp > column in my table using pyspark. > For example, I try this: > spark.sql('select id, time_ from mydb.table1`).show() > > Py4JJavaError: An error occurred while calling o2877.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: > org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148) > at > org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950) > at scala.Option.f
[jira] [Commented] (SPARK-32894) Timestamp cast in exernal ocr table
[ https://issues.apache.org/jira/browse/SPARK-32894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196783#comment-17196783 ] Grigory Skvortsov commented on SPARK-32894: --- >From hiveCli using following code: > Timestamp cast in exernal ocr table > --- > > Key: SPARK-32894 > URL: https://issues.apache.org/jira/browse/SPARK-32894 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.0 > Environment: Spark 3.0.0 > Java 1.8 > Hadoop 3.3.0 > Hive 3.1.2 > Python 3.7 (from pyspark) >Reporter: Grigory Skvortsov >Priority: Major > > I have the external hive table stored as orc. I want to work with timestamp > column in my table using pyspark. > For example, I try this: > spark.sql('select id, time_ from mydb.table1`).show() > > Py4JJavaError: An error occurred while calling o2877.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 19, 172.29.14.241, executor 1): java.lang.ClassCastException: > org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) > at > org.apache.spark.sql.catalyst.expressions.MutableLong.update(SpecificInternalRow.scala:148) > at > org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.update(SpecificInternalRow.scala:228) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.HiveInspectors.$anonfun$unwrapperFor$53$adapted(HiveInspectors.scala:730) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$.$anonfun$unwrapOrcStructs$4(OrcFileFormat.scala:351) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:96) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152) > at > org.apache
[jira] [Assigned] (SPARK-32899) Support submit application with user-defined cluster manager
[ https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32899: Assignee: (was: Apache Spark) > Support submit application with user-defined cluster manager > > > Key: SPARK-32899 > URL: https://issues.apache.org/jira/browse/SPARK-32899 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xianyang Liu >Priority: Major > > We have supported users to define the customed cluster manager with > `ExternalClusterManager` trait. However, we can not submit the application > with `SparkSubmit`. This patch adds the support to submit applications with > user-defined cluster manager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32899) Support submit application with user-defined cluster manager
[ https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196754#comment-17196754 ] Apache Spark commented on SPARK-32899: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/29770 > Support submit application with user-defined cluster manager > > > Key: SPARK-32899 > URL: https://issues.apache.org/jira/browse/SPARK-32899 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xianyang Liu >Priority: Major > > We have supported users to define the customed cluster manager with > `ExternalClusterManager` trait. However, we can not submit the application > with `SparkSubmit`. This patch adds the support to submit applications with > user-defined cluster manager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32899) Support submit application with user-defined cluster manager
[ https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196753#comment-17196753 ] Apache Spark commented on SPARK-32899: -- User 'ConeyLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/29770 > Support submit application with user-defined cluster manager > > > Key: SPARK-32899 > URL: https://issues.apache.org/jira/browse/SPARK-32899 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xianyang Liu >Priority: Major > > We have supported users to define the customed cluster manager with > `ExternalClusterManager` trait. However, we can not submit the application > with `SparkSubmit`. This patch adds the support to submit applications with > user-defined cluster manager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32899) Support submit application with user-defined cluster manager
[ https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32899: Assignee: Apache Spark > Support submit application with user-defined cluster manager > > > Key: SPARK-32899 > URL: https://issues.apache.org/jira/browse/SPARK-32899 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Xianyang Liu >Assignee: Apache Spark >Priority: Major > > We have supported users to define the customed cluster manager with > `ExternalClusterManager` trait. However, we can not submit the application > with `SparkSubmit`. This patch adds the support to submit applications with > user-defined cluster manager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32899) Support submit application with user-defined cluster manager
Xianyang Liu created SPARK-32899: Summary: Support submit application with user-defined cluster manager Key: SPARK-32899 URL: https://issues.apache.org/jira/browse/SPARK-32899 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: Xianyang Liu We have supported users to define the customed cluster manager with `ExternalClusterManager` trait. However, we can not submit the application with `SparkSubmit`. This patch adds the support to submit applications with user-defined cluster manager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org