[jira] [Resolved] (SPARK-29761) do not output leading 'interval' in CalendarInterval.toString
[ https://issues.apache.org/jira/browse/SPARK-29761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29761. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26401 [https://github.com/apache/spark/pull/26401] > do not output leading 'interval' in CalendarInterval.toString > - > > Key: SPARK-29761 > URL: https://issues.apache.org/jira/browse/SPARK-29761 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968987#comment-16968987 ] Hyukjin Kwon commented on SPARK-29625: -- [~sanysand...@gmail.com] can you at least make a minimized reproducer? > Spark Structure Streaming Kafka Wrong Reset Offset twice > > > Key: SPARK-29625 > URL: https://issues.apache.org/jira/browse/SPARK-29625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Sandish Kumar HN >Priority: Major > > Spark Structure Streaming Kafka Reset Offset twice, once with right offsets > and second time with very old offsets > {code} > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-151 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-118 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-85 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 122677634. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-19 to offset 0. > [2019-10-28 19:27:40,013] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO Fetcher: [Consumer clientId=consumer-1, > groupId=spark-kafka-source-cfacf6b7-b0aa-443f-b01d-b17212087545--1376165614-driver-0] > Resetting offset for partition topic-52 to offset 120504922.* > [2019-10-28 19:27:40,153] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > INFO ContextCleaner: Cleaned accumulator 810 > {code} > which is causing a Data loss issue. > {code} > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - 19/10/28 19:27:40 > ERROR StreamExecution: Query [id = d62ca9e4-6650-454f-8691-a3d576d1e4ba, > runId = 3946389f-222b-495c-9ab2-832c0422cbbb] terminated with error > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > java.lang.IllegalStateException: Partition topic-52's offset was changed from > 122677598 to 120504922, some data may have been missed. > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - Some data may have > been lost because they are not available in Kafka any more; either the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - data was aged out > by Kafka or the topic may have been deleted before all the data in the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - topic was > processed. If you don't want your streaming query to fail on such cases, set > the > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - source option > "failOnDataLoss" to "false". > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource.org$apache$spark$sql$kafka010$KafkaSource$$reportDataLoss(KafkaSource.scala:329) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:283) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > org.apache.spark.sql.kafka010.KafkaSource$$anonfun$8.apply(KafkaSource.scala:281) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247) > [2019-10-28 19:27:40,351] \{bash_operator.py:128} INFO - at > scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) > [2019-10-28 19:27:40,351]
[jira] [Resolved] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27763. -- Resolution: Done I am leaving this resolved for now. Thanks [~maropu] for finalizing this > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29786) Fix MetaException when dropping a partition not exists on HDFS.
Deegue created SPARK-29786: -- Summary: Fix MetaException when dropping a partition not exists on HDFS. Key: SPARK-29786 URL: https://issues.apache.org/jira/browse/SPARK-29786 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Deegue When we drop a partition which doesn't exist on HDFS, we will receive `MetaException`. But actually the partition has been dropped. In Hive, no exception will thrown in this case. For example: If we execute alter table test.tmp drop partition(stat_day=20190516); (the partition stat_day=20190516 exists on Hive meta, but doesn't exist on HDFS) We will get : {code:java} Error: Error running query: MetaException(message:File does not exist: /user/hive/warehouse/test.db/tmp/stat_day=20190516 at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2414) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4719) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1237) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getContentSummary(AuthorizationProviderProxyClientProtocol.java:568) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:896) at org.apache.hadoop.hdfs. protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2278) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2274) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2274) ) (state=,code=0) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29709) structured streaming The offset in the checkpoint is suddenly reset to the earliest
[ https://issues.apache.org/jira/browse/SPARK-29709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29709. -- Resolution: Incomplete > structured streaming The offset in the checkpoint is suddenly reset to the > earliest > --- > > Key: SPARK-29709 > URL: https://issues.apache.org/jira/browse/SPARK-29709 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: robert >Priority: Major > > structured streaming The offset in the checkpoint is suddenly reset to the > earliest, > One of the partitions in the kafka topic > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29785) Optimize opening a new session of Spark Thrift Server
Deegue created SPARK-29785: -- Summary: Optimize opening a new session of Spark Thrift Server Key: SPARK-29785 URL: https://issues.apache.org/jira/browse/SPARK-29785 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Deegue When we open a new session of Spark Thrift Server, `use default` is called and a free executor is needed to execute the SQL. This behavior adds ~5 seconds to opening a new session which should only cost ~100ms. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
[ https://issues.apache.org/jira/browse/SPARK-29784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968949#comment-16968949 ] pavithra ramachandran commented on SPARK-29784: --- i shall work on this > Built in function trim is not compatible in 3.0 with previous version > - > > Key: SPARK-29784 > URL: https://issues.apache.org/jira/browse/SPARK-29784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 > and 2.3.2 is returning after leading and trailing character removed. > Spark 3.0 – Not correct > jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+ > | trim(SL, SSparkSQLS) | > +---+ > | | > +--- > Spark 2.4 – Correct > jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); > +---+--+ > | trim(SSparkSQLS, SL) | > +---+--+ > | parkSQ | > +---+--+ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29784) Built in function trim is not compatible in 3.0 with previous version
ABHISHEK KUMAR GUPTA created SPARK-29784: Summary: Built in function trim is not compatible in 3.0 with previous version Key: SPARK-29784 URL: https://issues.apache.org/jira/browse/SPARK-29784 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA SELECT trim('SL', 'SSparkSQLS'); returns empty in Spark 3.0 where as in 2.4 and 2.3.2 is returning after leading and trailing character removed. Spark 3.0 – Not correct jdbc:hive2://10.18.19.208:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+ | trim(SL, SSparkSQLS) | +---+ | | +--- Spark 2.4 – Correct jdbc:hive2://10.18.18.214:23040/default> SELECT trim('SL', 'SSparkSQLS'); +---+--+ | trim(SSparkSQLS, SL) | +---+--+ | parkSQ | +---+--+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29771) Limit executor max failures before failing the application
[ https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968933#comment-16968933 ] Jackey Lee edited comment on SPARK-29771 at 11/7/19 5:12 AM: - This patch is mainly used in the scenario where the executor started failed. The executor runtime failure, which is caused by task errors is controlled by spark.executor.maxFailures. Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after spark-submit, which can also appear executor crazy retry. was (Author: jackey lee): This patch is mainly used in the scenario where the executor started failed. The executor runtime failurem, which is caused by task errors is controlled by spark.executor.maxFailures. Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after spark-submit, which can also appear executor crazy retry. > Limit executor max failures before failing the application > -- > > Key: SPARK-29771 > URL: https://issues.apache.org/jira/browse/SPARK-29771 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Priority: Major > > At present, K8S scheduling does not limit the number of failures of the > executors, which may cause executor retried continuously without failing. > A simple example, we add a resource limit on default namespace. After the > driver is started, if the quota is full, the executor will retry the creation > continuously, resulting in a large amount of pod information accumulation. > When many applications encounter such situations, they will affect the K8S > cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29771) Limit executor max failures before failing the application
[ https://issues.apache.org/jira/browse/SPARK-29771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968933#comment-16968933 ] Jackey Lee commented on SPARK-29771: This patch is mainly used in the scenario where the executor started failed. The executor runtime failurem, which is caused by task errors is controlled by spark.executor.maxFailures. Another Example, add `--conf spark.executor.extraJavaOptions=-Xmse` after spark-submit, which can also appear executor crazy retry. > Limit executor max failures before failing the application > -- > > Key: SPARK-29771 > URL: https://issues.apache.org/jira/browse/SPARK-29771 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Jackey Lee >Priority: Major > > At present, K8S scheduling does not limit the number of failures of the > executors, which may cause executor retried continuously without failing. > A simple example, we add a resource limit on default namespace. After the > driver is started, if the quota is full, the executor will retry the creation > continuously, resulting in a large amount of pod information accumulation. > When many applications encounter such situations, they will affect the K8S > cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27986) Support Aggregate Expressions with filter
[ https://issues.apache.org/jira/browse/SPARK-27986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-27986: --- Description: An aggregate expression represents the application of an aggregate function across the rows selected by a query. An aggregate function reduces multiple inputs to a single output value, such as the sum or average of the inputs. The syntax of an aggregate expression is one of the following: {noformat} aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]{noformat} [https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES] I find this feature is also an ANSI SQL standard. {code:java} ::= COUNT[ ] | [ ] | [ ] | [ ] | [ ] | [ ]{code} was: An aggregate expression represents the application of an aggregate function across the rows selected by a query. An aggregate function reduces multiple inputs to a single output value, such as the sum or average of the inputs. The syntax of an aggregate expression is one of the following: {noformat} aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ] aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) [ FILTER ( WHERE filter_clause ) ]{noformat} [https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES] > Support Aggregate Expressions with filter > - > > Key: SPARK-27986 > URL: https://issues.apache.org/jira/browse/SPARK-27986 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > An aggregate expression represents the application of an aggregate function > across the rows selected by a query. An aggregate function reduces multiple > inputs to a single output value, such as the sum or average of the inputs. > The syntax of an aggregate expression is one of the following: > {noformat} > aggregate_name (expression [ , ... ] [ order_by_clause ] ) [ FILTER ( WHERE > filter_clause ) ] > aggregate_name (ALL expression [ , ... ] [ order_by_clause ] ) [ FILTER ( > WHERE filter_clause ) ] > aggregate_name (DISTINCT expression [ , ... ] [ order_by_clause ] ) [ FILTER > ( WHERE filter_clause ) ] > aggregate_name ( * ) [ FILTER ( WHERE filter_clause ) ] > aggregate_name ( [ expression [ , ... ] ] ) WITHIN GROUP ( order_by_clause ) > [ FILTER ( WHERE filter_clause ) ]{noformat} > [https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES] > > I find this feature is also an ANSI SQL standard. > {code:java} > ::= > COUNT[ ] > | [ ] > | [ ] > | [ ] > | [ ] > | [ ]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29605) Optimize string to interval casting
[ https://issues.apache.org/jira/browse/SPARK-29605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29605: --- Assignee: Maxim Gekk > Optimize string to interval casting > --- > > Key: SPARK-29605 > URL: https://issues.apache.org/jira/browse/SPARK-29605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Implement new function stringToInterval in IntervalUtils to cast a value of > UTF8String to an instance of CalendarInterval that should be faster than > existing implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29605) Optimize string to interval casting
[ https://issues.apache.org/jira/browse/SPARK-29605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29605. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26256 [https://github.com/apache/spark/pull/26256] > Optimize string to interval casting > --- > > Key: SPARK-29605 > URL: https://issues.apache.org/jira/browse/SPARK-29605 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Implement new function stringToInterval in IntervalUtils to cast a value of > UTF8String to an instance of CalendarInterval that should be faster than > existing implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29600) array_contains built in function is not backward compatible in 3.0
[ https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968909#comment-16968909 ] ABHISHEK KUMAR GUPTA commented on SPARK-29600: -- [~Udbhav Agrawal]: I gone through the migration guide and it is more talking about the "In Spark 2.4, AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner." which is not the case in the above query. > array_contains built in function is not backward compatible in 3.0 > -- > > Key: SPARK-29600 > URL: https://issues.apache.org/jira/browse/SPARK-29600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws > Exception in 3.0 where as in 2.3.2 is working fine. > Spark 3.0 output: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), > CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS > DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS > DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function > array_contains should have been array followed by a value with same element > type, but it's [array, decimal(1,1)].; line 1 pos 7; > 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), > cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as > decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), > cast(0.033 as decimal(13,3))), 0.2), None)] > Spark 2.3.2 output > 0: jdbc:hive2://10.18.18.214:23040/default> SELECT > array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); > |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), > CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS > DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), > CAST(0.2 AS DECIMAL(13,3)))| > |true| > 1 row selected (0.18 seconds) > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968907#comment-16968907 ] Sandish Kumar HN edited comment on SPARK-29625 at 11/7/19 4:04 AM: --- [~hyukjin.kwon] It happened again, As I can't share the code here, I will just post the flow of the code, Look at partition-32 at [2019-11-06 04:33:40,013] And at [2019-11-06 04:33:40,015] Why is Spark is trying to reset twice within the same batch? one with right offset and another with the wrong offset. df-reader=spark.read df.transfer=spark.map (spark transforms functions) df-stream-writer=spark.readStream df-stream-writer=spark.writeStream SourceCode Flow -- analytics-session: type: spark name: etl-flow-correlate options: ( "spark.sql.shuffle.partitions" : "2" ; "spark.sql.windowExec.buffer.spill.threshold" : "98304" ) df-reader: name: ipmac format: jdbc options: ( "url" : "postgres.jdbc.url" ; "driver" : "org.postgresql.Driver" ; "dbtable" : "tablename" ) df-transform: name: ColumnRename transform: alias input: ipmac options: ( "id" : "key" ; "ip" : "src_ip" ) df-stream-reader: name: flow format: kafka options: ( "kafka.bootstrap.servers": "kafka.server" ; "subscribe": "topic name" ; "maxOffsetsPerTrigger": "40" ; "startingOffsets": "latest" ; "failOnDataLoss": "true" ) df-transform: name: decodedFlow transform: EventDecode input: flow options: ( "schema" : "flow" ) df-transform: name: flowWithDate transform: convertTimestampCols input: decodedFlow options: ( "timestampcols" : "start_time,end_time" ; "resolution" : "microsec" ) df-transform: name: correlatedFlow transform: correlateOverWindow input: flowWithDate options: ( "rightDF" : "ipmacColumnRename" ; "timestampcol" : "start_time" ; "timewindow" : "1 minute" ; "join-col" : "key,src_ip" ; "expires-min" : "expires_in_min" ) df-transform: name: nonNullFlow transform: filter input: correlatedFlow options: ( "notnull" : "mac" ) df-transform: name: flowColumnRename transform: alias input: nonNullFlow options: ( "key" : "tid" ) df-transform: name: flowEncoded transform: EventEncode input: flowColumnRename options: ( "key" : "id" ) df-stream-writer: name: flowEncoded format: kafka options: ( "kafka.bootstrap.servers" : "kafka.server" ; "checkpointLocation" : "spark.checkpoint.flow-correlation" ; "trigger-processing" : "10 seconds" ; "query-duration" : "60 minutes" ; "topic" : "topicname" ) Logs- --- 2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-185 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-32 to offset 16395511. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-65 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-98 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-131 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-164 to offset 1684812. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-197 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1,
[jira] [Reopened] (SPARK-29625) Spark Structure Streaming Kafka Wrong Reset Offset twice
[ https://issues.apache.org/jira/browse/SPARK-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandish Kumar HN reopened SPARK-29625: -- It happened again, As I can't share the code here, I will just post the flow of the code, Look at partition-32 at [2019-11-06 04:33:40,013] And at [2019-11-06 04:33:40,015] Why is Spark is trying to reset twice within the same batch? one with right offset and another with the wrong offset. df-reader=spark.read df.transfer=spark.map (spark transforms functions) df-stream-writer=spark.readStream df-stream-writer=spark.writeStream SourceCode Flow -- analytics-session: type: spark name: etl-flow-correlate options: ( "spark.sql.shuffle.partitions" : "2" ; "spark.sql.windowExec.buffer.spill.threshold" : "98304" ) df-reader: name: ipmac format: jdbc options: ( "url" : "postgres.jdbc.url" ; "driver" : "org.postgresql.Driver" ; "dbtable" : "tablename" ) df-transform: name: ColumnRename transform: alias input: ipmac options: ( "id" : "key" ; "ip" : "src_ip" ) df-stream-reader: name: flow format: kafka options: ( "kafka.bootstrap.servers": "kafka.server" ; "subscribe": "topic name" ; "maxOffsetsPerTrigger": "40" ; "startingOffsets": "latest" ; "failOnDataLoss": "true" ) df-transform: name: decodedFlow transform: EventDecode input: flow options: ( "schema" : "flow" ) df-transform: name: flowWithDate transform: convertTimestampCols input: decodedFlow options: ( "timestampcols" : "start_time,end_time" ; "resolution" : "microsec" ) df-transform: name: correlatedFlow transform: correlateOverWindow input: flowWithDate options: ( "rightDF" : "ipmacColumnRename" ; "timestampcol" : "start_time" ; "timewindow" : "1 minute" ; "join-col" : "key,src_ip" ; "expires-min" : "expires_in_min" ) df-transform: name: nonNullFlow transform: filter input: correlatedFlow options: ( "notnull" : "mac" ) df-transform: name: flowColumnRename transform: alias input: nonNullFlow options: ( "key" : "tid" ) df-transform: name: flowEncoded transform: EventEncode input: flowColumnRename options: ( "key" : "id" ) df-stream-writer: name: flowEncoded format: kafka options: ( "kafka.bootstrap.servers" : "kafka.server" ; "checkpointLocation" : "spark.checkpoint.flow-correlation" ; "trigger-processing" : "10 seconds" ; "query-duration" : "60 minutes" ; "topic" : "topicname" ) Logs 2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-185 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-32 to offset 16395511. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-65 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-98 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-131 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-164 to offset 1684812. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-197 to offset 0. [2019-11-06 04:33:40,013] \{bash_operator.py:128} INFO - 19/11/06 04:33:40 INFO Fetcher: [Consumer clientId=consumer-1, groupId=spark-kafka-source-57ea8170-0f95-4724-8168-1f9347b74661--1740056283-driver-0] Resetting offset for partition collector.endpoint.flow-11 to offset 10372315. [2019-11-06
[jira] [Comment Edited] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968895#comment-16968895 ] Udit Mehrotra edited comment on SPARK-29767 at 11/7/19 3:41 AM: [~hyukjin.kwon] was finally able to get the core dump of crashing executors. Attached *hs_err_pid13885.log* the error report written along with core dump. In that I notice the following trace: {noformat} RAX= [error occurred during error reporting (printing register info), id 0xb]Stack: [0x7fbe8850f000,0x7fbe8861], sp=0x7fbe8860dad0, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa9ae92] J 4331 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x7fbea94ffabe [0x7fbea94ffa00+0xbe] j org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5 j org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66 j org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5 j scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13 j scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22 j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78 j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5 j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8 j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13 j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27 j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26 j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33 j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24 j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26 j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33 j org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187 j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210 j org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37 j org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3 j org.apache.spark.executor.Executor$TaskRunner.run()V+383 j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub V [libjvm.so+0x680c5e] V [libjvm.so+0x67e024] V [libjvm.so+0x67e639] V [libjvm.so+0x6c3d41] V [libjvm.so+0xa77c22] V [libjvm.so+0x8c3b12] C [libpthread.so.0+0x7de5] start_thread+0xc5{noformat} Also attached the core dump file *coredump.zip* was (Author: uditme): [~hyukjin.kwon] was finally able to get the core dump of crashing executors. Attached *hs_err_pid13885.log* the error report written along with core dump. In that I notice the following trace: {noformat} RAX= [error occurred during error reporting (printing register info), id 0xb]Stack: [0x7fbe8850f000,0x7fbe8861], sp=0x7fbe8860dad0, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa9ae92] J 4331 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @
[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated SPARK-29767: -- Attachment: coredump.zip > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: coredump.zip, hs_err_pid13885.log, > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet > > > Running a union operation on two DataFrames through both Scala Spark Shell > and PySpark, resulting in executor contains doing a *core dump* and existing > with Exit code 134. > The trace from the *Driver*: > {noformat} > Container exited with a non-zero exit code 134 > . > 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure > (executor 11 exited caused by one of the running tasks) Reason: Container > from a bad node: container_1572981097605_0021_01_77 on host: > ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_1572981097605_0021_01_77 > Exit code: 134 > Exception message: /bin/bash: line 1: 12611 Aborted > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack > trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted > > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr > at
[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968895#comment-16968895 ] Udit Mehrotra commented on SPARK-29767: --- [~hyukjin.kwon] was finally able to get the core dump of crashing executors. Attached *hs_err_pid13885.log* the error report written along with core dump. In that I notice the following trace: {noformat} RAX= [error occurred during error reporting (printing register info), id 0xb]Stack: [0x7fbe8850f000,0x7fbe8861], sp=0x7fbe8860dad0, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa9ae92] J 4331 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x7fbea94ffabe [0x7fbea94ffa00+0xbe] j org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5 j org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66 j org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25 j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5 j scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13 j scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22 j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78 j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5 j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8 j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13 j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27 j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26 j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33 j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24 j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26 j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33 j org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187 j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210 j org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37 j org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3 j org.apache.spark.executor.Executor$TaskRunner.run()V+383 j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub V [libjvm.so+0x680c5e] V [libjvm.so+0x67e024] V [libjvm.so+0x67e639] V [libjvm.so+0x6c3d41] V [libjvm.so+0xa77c22] V [libjvm.so+0x8c3b12] C [libpthread.so.0+0x7de5] start_thread+0xc5{noformat} > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: hs_err_pid13885.log, > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet > > > Running a union operation on two DataFrames
[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udit Mehrotra updated SPARK-29767: -- Attachment: hs_err_pid13885.log > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: hs_err_pid13885.log, > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet > > > Running a union operation on two DataFrames through both Scala Spark Shell > and PySpark, resulting in executor contains doing a *core dump* and existing > with Exit code 134. > The trace from the *Driver*: > {noformat} > Container exited with a non-zero exit code 134 > . > 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure > (executor 11 exited caused by one of the running tasks) Reason: Container > from a bad node: container_1572981097605_0021_01_77 on host: > ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_1572981097605_0021_01_77 > Exit code: 134 > Exception message: /bin/bash: line 1: 12611 Aborted > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack > trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted > > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderr > at
[jira] [Comment Edited] (SPARK-28794) Document CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966433#comment-16966433 ] pavithra ramachandran edited comment on SPARK-28794 at 11/7/19 3:08 AM: i shall raise PR by weekend. was (Author: pavithraramachandran): i shall raise PR by tomorrow. > Document CREATE TABLE in SQL Reference. > --- > > Key: SPARK-28794 > URL: https://issues.apache.org/jira/browse/SPARK-28794 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29783) Support SQL Standard output style for interval type
Kent Yao created SPARK-29783: Summary: Support SQL Standard output style for interval type Key: SPARK-29783 URL: https://issues.apache.org/jira/browse/SPARK-29783 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Support sql standard interval-style for output. ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29778) saveAsTable append mode is not passing writer options
[ https://issues.apache.org/jira/browse/SPARK-29778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968881#comment-16968881 ] yueqi commented on SPARK-29778: --- Does v1 table support append? just checking > saveAsTable append mode is not passing writer options > - > > Key: SPARK-29778 > URL: https://issues.apache.org/jira/browse/SPARK-29778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Critical > > There was an oversight where AppendData is not getting the WriterOptions in > saveAsTable. > [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Description: In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use Broadcast.destroy() method can not destroy the broadcast data, the driver and executor storage memory in spark ui is continuous increase。 {code:java} val batch = Seq(1 to : _*) val strSeq = batch.map(i => s"xxh-$i") val rdd = sc.parallelize(strSeq) rdd.cache() batch.foreach(_ => { val broc = sc.broadcast(strSeq) rdd.map(id => broc.value.contains(id)).collect() broc.destroy() }) {code} was: In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use Broadcast.destroy() method can not destroy the broadcast data, the driver and executor storage memory in spark ui is continuous increase。 {code:java} //代码占位符 val batch = Seq(1 to : _*) val strSeq = batch.map(i => s"xxh-$i") val rdd = sc.parallelize(strSeq) rdd.cache() batch.foreach(_ => { val broc = sc.broadcast(strSeq) rdd.map(id => broc.value.contains(id)).collect() broc.destroy() }) {code} > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: correct version.png, problem versions.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968853#comment-16968853 ] Sean R. Owen commented on SPARK-29760: -- I mean, what would you add where? you could open a PR to propose it. It may not be worth mentioning separately unless there's a clear place for it. > Document VALUES statement in SQL Reference. > --- > > Key: SPARK-29760 > URL: https://issues.apache.org/jira/browse/SPARK-29760 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > spark-sql also supports *VALUES *. > {code:java} > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three'); > 1 one > 2 two > 3 three > Time taken: 0.015 seconds, Fetched 3 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2; > 1 one > 2 two > Time taken: 0.014 seconds, Fetched 2 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2; > 1 one > 3 three > 2 two > Time taken: 0.153 seconds, Fetched 3 row(s) > spark-sql> > {code} > or even *values *can be used along with INSERT INTO or select. > refer: https://www.postgresql.org/docs/current/sql-values.html > So please confirm VALUES also can be documented or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968850#comment-16968850 ] jobit mathew commented on SPARK-29760: -- [~srowen],it can be added as a part of Build a SQL reference doc https://issues.apache.org/jira/browse/SPARK-28588 > Document VALUES statement in SQL Reference. > --- > > Key: SPARK-29760 > URL: https://issues.apache.org/jira/browse/SPARK-29760 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > spark-sql also supports *VALUES *. > {code:java} > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three'); > 1 one > 2 two > 3 three > Time taken: 0.015 seconds, Fetched 3 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2; > 1 one > 2 two > Time taken: 0.014 seconds, Fetched 2 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2; > 1 one > 3 three > 2 two > Time taken: 0.153 seconds, Fetched 3 row(s) > spark-sql> > {code} > or even *values *can be used along with INSERT INTO or select. > refer: https://www.postgresql.org/docs/current/sql-values.html > So please confirm VALUES also can be documented or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: problem versions.png > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: correct version.png, problem versions.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: correct version.png > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: correct version.png, problem versions.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: (was: 33.png) > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: correct version.png, problem versions.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: (was: 44.png) > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: correct version.png, problem versions.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Description: In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use Broadcast.destroy() method can not destroy the broadcast data, the driver and executor storage memory in spark ui is continuous increase。 {code:java} //代码占位符 val batch = Seq(1 to : _*) val strSeq = batch.map(i => s"xxh-$i") val rdd = sc.parallelize(strSeq) rdd.cache() batch.foreach(_ => { val broc = sc.broadcast(strSeq) rdd.map(id => broc.value.contains(id)).collect() broc.destroy() }) {code} was: In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use Broadcast.destroy() method can not destroy the broadcast data, the driver and executor storage memory in spark ui is continuous increase。 {code:java} //代码占位符 val batch = Seq(1 to : _*) val strSeq = batch.map(i => s"xxh-$i") v al rdd = sc.parallelize(strSeq) rdd.cache() batch.foreach(_ => { val broc = sc.broadcast(strSeq) rdd.map(id => broc.value.contains(id)).collect() broc.destroy() }) {code} > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: 33.png, 44.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") > val rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: 44.png > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: 33.png, 44.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") v > al rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: 33.png > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > Attachments: 33.png > > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") v > al rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29782) spark broadcast can not be destoryed in some versions
tzxxh created SPARK-29782: - Summary: spark broadcast can not be destoryed in some versions Key: SPARK-29782 URL: https://issues.apache.org/jira/browse/SPARK-29782 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.3.3 Reporter: tzxxh In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use Broadcast.destroy() method can not destroy the broadcast data, the driver and executor storage memory in spark ui is continuous increase。 {code:java} //代码占位符 val batch = Seq(1 to : _*) val strSeq = batch.map(i => s"xxh-$i") v al rdd = sc.parallelize(strSeq) rdd.cache() batch.foreach(_ => { val broc = sc.broadcast(strSeq) rdd.map(id => broc.value.contains(id)).collect() broc.destroy() }) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: 1.png > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") v > al rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29782) spark broadcast can not be destoryed in some versions
[ https://issues.apache.org/jira/browse/SPARK-29782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tzxxh updated SPARK-29782: -- Attachment: (was: 1.png) > spark broadcast can not be destoryed in some versions > - > > Key: SPARK-29782 > URL: https://issues.apache.org/jira/browse/SPARK-29782 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.3.3, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: tzxxh >Priority: Major > > In spark version (2.3.3 , 2.4.1 , 2.4.2 , 2.4.3 , 2.4.4) use > Broadcast.destroy() method can not destroy the broadcast data, the driver and > executor storage memory in spark ui is continuous increase。 > {code:java} > //代码占位符 > val batch = Seq(1 to : _*) > val strSeq = batch.map(i => s"xxh-$i") v > al rdd = sc.parallelize(strSeq) > rdd.cache() > batch.foreach(_ => { > val broc = sc.broadcast(strSeq) > rdd.map(id => broc.value.contains(id)).collect() > broc.destroy() > }) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink
[ https://issues.apache.org/jira/browse/SPARK-29635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29635. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26292 [https://github.com/apache/spark/pull/26292] > Deduplicate test suites between Kafka micro-batch sink and Kafka continuous > sink > > > Key: SPARK-29635 > URL: https://issues.apache.org/jira/browse/SPARK-29635 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.0.0 > > > There's a comment in KafkaContinuousSinkSuite which is most likely explaining > TODO: > https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39 > {noformat} > /** > * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 > memory stream. > * Once we have one, this will be changed to a specialization of > KafkaSinkSuite and we won't have > * to duplicate all the code. > */ > {noformat} > Given latest master branch has V2 memory stream now, it is a good time to > deduplicate two suites into one, via having base class and let these suites > override necessary things. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29635) Deduplicate test suites between Kafka micro-batch sink and Kafka continuous sink
[ https://issues.apache.org/jira/browse/SPARK-29635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29635: -- Assignee: Jungtaek Lim > Deduplicate test suites between Kafka micro-batch sink and Kafka continuous > sink > > > Key: SPARK-29635 > URL: https://issues.apache.org/jira/browse/SPARK-29635 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > > There's a comment in KafkaContinuousSinkSuite which is most likely explaining > TODO: > https://github.com/apache/spark/blob/37690dea107623ebca1e47c64db59196ee388f2f/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala#L35-L39 > {noformat} > /** > * This is a temporary port of KafkaSinkSuite, since we do not yet have a V2 > memory stream. > * Once we have one, this will be changed to a specialization of > KafkaSinkSuite and we won't have > * to duplicate all the code. > */ > {noformat} > Given latest master branch has V2 memory stream now, it is a good time to > deduplicate two suites into one, via having base class and let these suites > override necessary things. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29720) Add linux condition to make ProcfsMetricsGetter more complete
[ https://issues.apache.org/jira/browse/SPARK-29720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29720. -- Resolution: Not A Problem > Add linux condition to make ProcfsMetricsGetter more complete > - > > Key: SPARK-29720 > URL: https://issues.apache.org/jira/browse/SPARK-29720 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: ulysses you >Priority: Minor > > Now decide whether it can be gather proc stat to executor metrics is that > {code:java} > procDirExists.get && shouldLogStageExecutorProcessTreeMetrics && > shouldLogStageExecutorMetrics > {code} > But the proc is only support for linux, so it should add a condition that > isLinux . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17814) spark submit arguments are truncated in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-17814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968816#comment-16968816 ] Juan Ramos Fuentes commented on SPARK-17814: Looks like I'm about 3 years too late, but I just encountered this same issue. I'm also passing a string of JSON as an arg to spark-submit and seeing the last two curly braces being replaced with empty string. My temporary solution is to avoid having two curly braces next to each other, but I was wondering if you have any thoughts for how to solve this issue [~jerryshao] [~shreyass123] > spark submit arguments are truncated in yarn-cluster mode > - > > Key: SPARK-17814 > URL: https://issues.apache.org/jira/browse/SPARK-17814 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 1.6.1 >Reporter: shreyas subramanya >Priority: Minor > > {noformat} > One of our customers is trying to pass in json through spark-submit as > follows: > spark-submit --verbose --class SimpleClass --master yarn-cluster ./simple.jar > "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}" > The application reports the passed arguments as: {"mode":"wf", > "arrays":{"array":[1] > If the same application is submitted in yarn-client mode, as follows: > spark-submit --verbose --class SimpleClass --master yarn-client ./simple.jar > "{\"mode\":\"wf\", \"arrays\":{\"array\":[1]}}" > The application reports the passed args as: {"mode":"wf", > "arrays":{"array":[1]}} > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29781) Override SBT Jackson-databind dependency like Maven
[ https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29781: -- Summary: Override SBT Jackson-databind dependency like Maven (was: Override SBT Jackson dependency like Maven) > Override SBT Jackson-databind dependency like Maven > --- > > Key: SPARK-29781 > URL: https://issues.apache.org/jira/browse/SPARK-29781 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Priority: Major > > This is `branch-2.4` only issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29781) Override SBT Jackson dependency like Maven
Dongjoon Hyun created SPARK-29781: - Summary: Override SBT Jackson dependency like Maven Key: SPARK-29781 URL: https://issues.apache.org/jira/browse/SPARK-29781 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 2.4.4 Reporter: Dongjoon Hyun This is `branch-2.4` only issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29781) Override SBT Jackson dependency like Maven
[ https://issues.apache.org/jira/browse/SPARK-29781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29781: -- Component/s: (was: Spark Core) > Override SBT Jackson dependency like Maven > -- > > Key: SPARK-29781 > URL: https://issues.apache.org/jira/browse/SPARK-29781 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4 >Reporter: Dongjoon Hyun >Priority: Major > > This is `branch-2.4` only issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29780) The UI can access into the ResourceAllocator, whose data structures are being updated from scheduler threads
Alessandro Bellina created SPARK-29780: -- Summary: The UI can access into the ResourceAllocator, whose data structures are being updated from scheduler threads Key: SPARK-29780 URL: https://issues.apache.org/jira/browse/SPARK-29780 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Alessandro Bellina A class extending ResourceAllocator (WorkerResourceInfo), has some potential issues (raised here: https://github.com/apache/spark/pull/26078#discussion_r340342820) The WorkerInfo class is calling into availableAddrs and assignedAddrs but those calls appear to be coming from the UI (looking at the resourcesInfo* functions), e.g. JsonProtocol and MasterPage call this. Since the datastructures in ResourceAllocator are not concurrent, we could end up with bad data or potentially crashes, depending on when the calls are made. Note that there are other calls to the resourceInfo* functions, but those are from the event loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29759) LocalShuffleReaderExec.outputPartitioning should use the corrected attributes
[ https://issues.apache.org/jira/browse/SPARK-29759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-29759. - Fix Version/s: 3.0.0 Resolution: Fixed > LocalShuffleReaderExec.outputPartitioning should use the corrected attributes > - > > Key: SPARK-29759 > URL: https://issues.apache.org/jira/browse/SPARK-29759 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968760#comment-16968760 ] shahid commented on SPARK-29765: Event drop can also happen that if we don't provide sufficient memory. That case increasing the configs won't help I think. If the event drop happens the entire UI behaves weirdly. > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) >
[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968757#comment-16968757 ] Udit Mehrotra commented on SPARK-29767: --- [~hyukjin.kwon] As I have mentioned in the description and you an see from the *stdout* logs it fails to write the *core dump*. Any idea how I can get around it ? > Core dump happening on executors while doing simple union of Data Frames > > > Key: SPARK-29767 > URL: https://issues.apache.org/jira/browse/SPARK-29767 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR 5.27.0, Spark 2.4.4 >Reporter: Udit Mehrotra >Priority: Major > Attachments: > part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet > > > Running a union operation on two DataFrames through both Scala Spark Shell > and PySpark, resulting in executor contains doing a *core dump* and existing > with Exit code 134. > The trace from the *Driver*: > {noformat} > Container exited with a non-zero exit code 134 > . > 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; > aborting job > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 > (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure > (executor 11 exited caused by one of the running tasks) Reason: Container > from a bad node: container_1572981097605_0021_01_77 on host: > ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from > container-launch. > Container id: container_1572981097605_0021_01_77 > Exit code: 134 > Exception message: /bin/bash: line 1: 12611 Aborted > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout > 2> > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stderrStack > trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted > > LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" > /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' > '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' > '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' > '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' > -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/tmp > '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' > -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77 > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 > --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id > application_1572981097605_0021 --user-class-path > file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_77/__app__.jar > > > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_77/stdout >
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code} df <- createDataFrame(data.frame(x=1)) f1 <- function(x) x + 1 f2 <- function(x) f1(x) + 2 dapplyCollect(df, function(x) { f1(x); f2(x) }) {code} We get following error message: {code} org.apache.spark.SparkException: R computation failed with Error in f1(x) : could not find function "f1" Calls: compute -> computeFunc -> f2 {code} Compare that to this code block with succeeds: {code} dapplyCollect(df, function(x) { f2(x) }) {code} was: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block that succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code} > df <- createDataFrame(data.frame(x=1)) > f1 <- function(x) x + 1 > f2 <- function(x) f1(x) + 2 > dapplyCollect(df, function(x) { f1(x); f2(x) }) > {code} > We get following error message: > {code} > org.apache.spark.SparkException: R computation failed with > Error in f1(x) : could not find function "f1" > Calls: compute -> computeFunc -> f2 > {code} > Compare that to this code block with succeeds: > {code} > dapplyCollect(df, function(x) { f2(x) }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968753#comment-16968753 ] Viacheslav Tradunsky commented on SPARK-29765: -- What's in particularly interesting, that the error doesn't reproduce when this stage is in completed list. > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748){code}
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968751#comment-16968751 ] Viacheslav Tradunsky commented on SPARK-29765: -- Increased the capacity to 2 as pointed out in docs: [https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html#scheduling] But the error still happens. > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at >
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968744#comment-16968744 ] Bryan Cutler commented on SPARK-28502: -- Ahh, so Arrow 0.15.0+ had a change in the IPC format that requires pyspark to set an env var. See [https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x,] that should fix the problem with the Spark preview and once SPARK-29376 is merged in 3.0, you won't need to do this. {quote} I have to manually add window to returning dataframe. Is there a way to automatically concatenate results of udf? {quote} I don't believe there is a way to add the key/window in the DataFrame automatically, you will have to manually add it in the udf. > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968739#comment-16968739 ] Viacheslav Tradunsky commented on SPARK-29765: -- I think it is capacity after code review ;) spark.scheduler.listenerbus.eventqueue.capacity Thanks a lot! > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at
[jira] [Updated] (SPARK-29779) Compact old event log files and clean up
[ https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29779: - Parent: SPARK-28594 Issue Type: Sub-task (was: Task) > Compact old event log files and clean up > > > Key: SPARK-29779 > URL: https://issues.apache.org/jira/browse/SPARK-29779 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968737#comment-16968737 ] shahid commented on SPARK-29765: Thanks. Yes, you can try increasing `spark.scheduler.listenerbus.eventqueue.size`. > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian Jira
[jira] [Created] (SPARK-29779) Compact old event log files and clean up
Jungtaek Lim created SPARK-29779: Summary: Compact old event log files and clean up Key: SPARK-29779 URL: https://issues.apache.org/jira/browse/SPARK-29779 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim This issue is to track the effort on compacting old event logs (and cleaning up after compaction) without breaking guaranteeing of compatibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block that succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} was: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code:java} > library(SparkR) > sparkR.session() > spark_df <- createDataFrame(na.omit(airquality)) > cody_local2 <- function(param2) { > 10 + param2 > } > cody_local1 <- function(param1) { > cody_local2(param1) > } > result <- cody_local2(5) > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > cody_local1(5) > }) > print(result) > {code} > We get following error message: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 > (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R > computation failed with > Error in cody_local2(param1) : could not find function "cody_local2" > Calls: compute -> computeFunc -> cody_local1 > {code} > Compare that to this code block that succeeds: > {code:java} > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > #cody_local1(5) > }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
[ https://issues.apache.org/jira/browse/SPARK-29777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hossein Falaki updated SPARK-29777: --- Description: Following code block reproduces the issue: {code:java} library(SparkR) sparkR.session() spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code:java} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} was: Following code block reproduces the issue: {code} library(SparkR) spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} > SparkR::cleanClosure aggressively removes a function required by user function > -- > > Key: SPARK-29777 > URL: https://issues.apache.org/jira/browse/SPARK-29777 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.4 >Reporter: Hossein Falaki >Priority: Major > > Following code block reproduces the issue: > {code:java} > library(SparkR) > sparkR.session() > spark_df <- createDataFrame(na.omit(airquality)) > cody_local2 <- function(param2) { > 10 + param2 > } > cody_local1 <- function(param1) { > cody_local2(param1) > } > result <- cody_local2(5) > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > cody_local1(5) > }) > print(result) > {code} > We get following error message: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 > (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R > computation failed with > Error in cody_local2(param1) : could not find function "cody_local2" > Calls: compute -> computeFunc -> cody_local1 > {code} > Compare that to this code block with succeeds: > {code:java} > calc_df <- dapplyCollect(spark_df, function(x) { > cody_local2(20) > #cody_local1(5) > }) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968720#comment-16968720 ] Viacheslav Tradunsky commented on SPARK-29765: -- Exactly as you said: {code:java} ERROR AsyncEventQueue: Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler{code} Will increase the capacity and go back with results. Thanks! > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at >
[jira] [Comment Edited] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968703#comment-16968703 ] Nasir Ali edited comment on SPARK-28502 at 11/6/19 9:17 PM: [~bryanc] If I perform any agg (e.g., _df.groupBy("id",F.window("ts", "15 days")).agg(\{"id":"avg"}).show()_ ) on grouped data, pyspark returns me the key (e.g., id, window) with the avg for each group. However, in the above example, when udf returns the struct, it does not automatically return the key. I have to manually add window to returning dataframe. Is there a way to automatically concatenate results of udf? was (Author: nasirali): [~bryanc] If I perform any agg (e.g., avg) on grouped data, pyspark returns me the key (e.g., window etc.) with the avg for each row. However, in the above example, when udf returns the struct, it does not automatically return the key. I have to manually add window to returning dataframe. Is there a way to automatically concatenate results of udf with key? > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968707#comment-16968707 ] shahid commented on SPARK-29765: Yes, then it seems some event drop has happened. Could you check the logs related to the event drop? To prevent this, I think you can increase the Event queue capacity which is by default 1000 I think > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at >
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968704#comment-16968704 ] Viacheslav Tradunsky commented on SPARK-29765: -- [~shahid] Reproduced. An interesting thing is that despite the stage was marked as completed, on UI it showed tasks 106632/109889 (3 running). > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968703#comment-16968703 ] Nasir Ali commented on SPARK-28502: --- [~bryanc] If I perform any agg (e.g., avg) on grouped data, pyspark returns me the key (e.g., window etc.) with the avg for each row. However, in the above example, when udf returns the struct, it does not automatically return the key. I have to manually add window to returning dataframe. Is there a way to automatically concatenate results of udf with key? > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968657#comment-16968657 ] L. C. Hsieh commented on SPARK-17495: - This looks should be resolved now? Last comments from [~tejasp] is about time related datatypes. Seems they were added long time ago and there is no todos in the code. > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: bulk-closed > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29778) saveAsTable append mode is not passing writer options
Burak Yavuz created SPARK-29778: --- Summary: saveAsTable append mode is not passing writer options Key: SPARK-29778 URL: https://issues.apache.org/jira/browse/SPARK-29778 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Burak Yavuz There was an oversight where AppendData is not getting the WriterOptions in saveAsTable. [https://github.com/apache/spark/blob/782992c7ed652400e33bc4b1da04c8155b7b3866/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L530] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968652#comment-16968652 ] shahid commented on SPARK-29765: I still not sure about the root cause, as I am not able to reproduce with small data. From the number I can see that it is something related to cleaning up the store, when the number of tasks exceeds the threshold. If you can still reproduce with the same data even after increasing the threshold, then it might be due to some other issue. > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at >
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968649#comment-16968649 ] Viacheslav Tradunsky commented on SPARK-29765: -- [~shahid] sure, in case we have more than 200 000 tasks shall we set this to our maximum? Just trying to understand how GC of objects could influence access in immutable collection. Do you allow somehow the elements to be collected when they are referenced by that vector? > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at >
[jira] [Created] (SPARK-29777) SparkR::cleanClosure aggressively removes a function required by user function
Hossein Falaki created SPARK-29777: -- Summary: SparkR::cleanClosure aggressively removes a function required by user function Key: SPARK-29777 URL: https://issues.apache.org/jira/browse/SPARK-29777 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.4 Reporter: Hossein Falaki Following code block reproduces the issue: {code} library(SparkR) spark_df <- createDataFrame(na.omit(airquality)) cody_local2 <- function(param2) { 10 + param2 } cody_local1 <- function(param1) { cody_local2(param1) } result <- cody_local2(5) calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) cody_local1(5) }) print(result) {code} We get following error message: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 27, 10.0.174.239, executor 0): org.apache.spark.SparkException: R computation failed with Error in cody_local2(param1) : could not find function "cody_local2" Calls: compute -> computeFunc -> cody_local1 {code} Compare that to this code block with succeeds: {code} calc_df <- dapplyCollect(spark_df, function(x) { cody_local2(20) #cody_local1(5) }) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968642#comment-16968642 ] shahid edited comment on SPARK-29765 at 11/6/19 7:39 PM: - Just to narrow down the problem, can you please increase `spark.ui.retainedTasks` from 10 to 20 and check if the issue still exist ? Thanks was (Author: shahid): Just to narrow down the problem, can you please increase `spark.ui.retainedTasks` from 10 to 20 ? Thanks > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at >
[jira] [Commented] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage
[ https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968642#comment-16968642 ] shahid commented on SPARK-29765: Just to narrow down the problem, can you please increase `spark.ui.retainedTasks` from 10 to 20 ? Thanks > Monitoring UI throws IndexOutOfBoundsException when accessing metrics of > attempt in stage > - > > Key: SPARK-29765 > URL: https://issues.apache.org/jira/browse/SPARK-29765 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: Amazon EMR 5.27 >Reporter: Viacheslav Tradunsky >Priority: Major > > When clicking on one of the largest tasks by input, I get to > [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0] > with 500 error > {code:java} > java.lang.IndexOutOfBoundsException: 95745 at > scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at > scala.collection.immutable.Vector.apply(Vector.scala:122) at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at > scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at > org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) > at > org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748){code} -- This message was
[jira] [Assigned] (SPARK-29642) ContinuousMemoryStream throws error on String type
[ https://issues.apache.org/jira/browse/SPARK-29642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29642: -- Assignee: Jungtaek Lim > ContinuousMemoryStream throws error on String type > -- > > Key: SPARK-29642 > URL: https://issues.apache.org/jira/browse/SPARK-29642 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > While we can set String as a generic type of ContinuousMemoryStream, it > doesn't work really because it doesn't convert String to UTFString and > accessing it from Row interface would throw error. > We should encode the input and convert the input to Row properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29642) ContinuousMemoryStream throws error on String type
[ https://issues.apache.org/jira/browse/SPARK-29642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29642. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26300 [https://github.com/apache/spark/pull/26300] > ContinuousMemoryStream throws error on String type > -- > > Key: SPARK-29642 > URL: https://issues.apache.org/jira/browse/SPARK-29642 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > While we can set String as a generic type of ContinuousMemoryStream, it > doesn't work really because it doesn't convert String to UTFString and > accessing it from Row interface would throw error. > We should encode the input and convert the input to Row properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-29571) Fix UT in AllExecutionsPageSuite class
[ https://issues.apache.org/jira/browse/SPARK-29571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Raj Boudh closed SPARK-29571. --- Issue got fixed. > Fix UT in AllExecutionsPageSuite class > --- > > Key: SPARK-29571 > URL: https://issues.apache.org/jira/browse/SPARK-29571 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Ankit Raj Boudh >Assignee: Ankit Raj Boudh >Priority: Minor > Fix For: 3.0.0 > > > sorting should be successful UT in class AllExecutionsPageSuite failing due > to invalid assert condition. Needs to handle this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29752) make AdaptiveQueryExecSuite more robust
[ https://issues.apache.org/jira/browse/SPARK-29752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-29752. - Fix Version/s: 3.0.0 Resolution: Fixed > make AdaptiveQueryExecSuite more robust > --- > > Key: SPARK-29752 > URL: https://issues.apache.org/jira/browse/SPARK-29752 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29603) Support application priority for spark on yarn
[ https://issues.apache.org/jira/browse/SPARK-29603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29603: -- Assignee: Kent Yao > Support application priority for spark on yarn > -- > > Key: SPARK-29603 > URL: https://issues.apache.org/jira/browse/SPARK-29603 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > We can set priority to an application for YARN to define pending applications > ordering policy, those with higher priority have a better opportunity to be > activated. YARN CapacityScheduler only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29603) Support application priority for spark on yarn
[ https://issues.apache.org/jira/browse/SPARK-29603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29603. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26255 [https://github.com/apache/spark/pull/26255] > Support application priority for spark on yarn > -- > > Key: SPARK-29603 > URL: https://issues.apache.org/jira/browse/SPARK-29603 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > We can set priority to an application for YARN to define pending applications > ordering policy, those with higher priority have a better opportunity to be > activated. YARN CapacityScheduler only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29588) Improvements in WebUI JDBC/ODBC server page
[ https://issues.apache.org/jira/browse/SPARK-29588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shahid updated SPARK-29588: --- Priority: Major (was: Minor) > Improvements in WebUI JDBC/ODBC server page > --- > > Key: SPARK-29588 > URL: https://issues.apache.org/jira/browse/SPARK-29588 > Project: Spark > Issue Type: Umbrella > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 >Reporter: shahid >Priority: Major > > Improvements in JDBC/ODBC server page -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS
[ https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968435#comment-16968435 ] Shyam commented on SPARK-25466: --- [~gsomogyi] I am still facing the same issue how to fix it ? I tried these things mentioned here in SOF [https://stackoverflow.com/questions/58456939/how-to-set-spark-consumer-cache-to-fix-kafkaconsumer-cache-hitting-max-capaci] > Documentation does not specify how to set Kafka consumer cache capacity for SS > -- > > Key: SPARK-25466 > URL: https://issues.apache.org/jira/browse/SPARK-25466 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Affects Versions: 2.3.0 >Reporter: Patrick McGloin >Priority: Minor > > When hitting this warning with SS: > 19-09-2018 12:05:27 WARN CachedKafkaConsumer:66 - KafkaConsumer cache > hitting max capacity of 64, removing consumer for > CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30) > If you Google you get to this page: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > Which is for Spark Streaming and says to use this config item to adjust the > capacity: "spark.streaming.kafka.consumer.cache.maxCapacity". > This is a bit confusing as SS uses a different config item: > "spark.sql.kafkaConsumerCache.capacity" > Perhaps the SS Kafka documentation should talk about the consumer cache > capacity? Perhaps here? > https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html > Or perhaps the warning message should reference the config item. E.g > 19-09-2018 12:05:27 WARN CachedKafkaConsumer:66 - KafkaConsumer cache > hitting max capacity of 64, removing consumer for > CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30). > *The cache size can be adjusted with the setting > "spark.sql.kafkaConsumerCache.capacity".* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28725) Spark ML not able to de-serialize Logistic Regression model saved in previous version of Spark
[ https://issues.apache.org/jira/browse/SPARK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-28725. -- Resolution: Invalid > Spark ML not able to de-serialize Logistic Regression model saved in previous > version of Spark > -- > > Key: SPARK-28725 > URL: https://issues.apache.org/jira/browse/SPARK-28725 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 > Environment: PROD >Reporter: Sharad Varshney >Priority: Major > > Logistic Regression model saved using Spark version 2.3.0 in HDI is not > correctly de-serialized with Spark 2.4.3 version. It loads into the memory > but probabilities it emits on inference is like 1.45 e-44(to 44th decimal > place approx equal to 0) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29325) approxQuantile() results are incorrect and vary significantly for small changes in relativeError
[ https://issues.apache.org/jira/browse/SPARK-29325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29325. -- Resolution: Duplicate > approxQuantile() results are incorrect and vary significantly for small > changes in relativeError > > > Key: SPARK-29325 > URL: https://issues.apache.org/jira/browse/SPARK-29325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.4 > Environment: I was using OSX 10.14.6. > I was using Scala 2.11.12 and Spark 2.4.4. > I also verified the bug exists for Scala 2.11.8 and Spark 2.3.2. >Reporter: James Verbus >Priority: Major > Labels: correctness > Attachments: 20191001_example_data_approx_quantile_bug.zip > > > The [approxQuantile() > method|https://github.com/apache/spark/blob/3b1674cb1f244598463e879477d89632b0817578/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L40] > returns sometimes incorrect results that are sensitively dependent upon the > choice of the relativeError. > Below is an example in the latest Spark version (2.4.4). You can see the > result varies significantly for modest changes in the specified relativeError > parameter. The result varies much more than would be expected based upon the > relativeError parameter. > > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 > /_/ > > Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_212) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df = spark.read.format("csv").option("header", > "true").option("inferSchema", > "true").load("./20191001_example_data_approx_quantile_bug") > df: org.apache.spark.sql.DataFrame = [value: double] > scala> df.stat.approxQuantile("value", Array(0.9), 0) > res0: Array[Double] = Array(0.5929591082174609) > scala> df.stat.approxQuantile("value", Array(0.9), 0.001) > res1: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.002) > res2: Array[Double] = Array(0.5926195654486178) > scala> df.stat.approxQuantile("value", Array(0.9), 0.003) > res3: Array[Double] = Array(0.5924693999048418) > scala> df.stat.approxQuantile("value", Array(0.9), 0.004) > res4: Array[Double] = Array(0.67621027121925) > scala> df.stat.approxQuantile("value", Array(0.9), 0.005) > res5: Array[Double] = Array(0.5923925937051544) > {code} > I attached a zip file containing the data used for the above example > demonstrating the bug. > Also, the following demonstrates that there is data for intermediate quantile > values between the 0.5926195654486178 and 0.67621027121925 values observed > above. > {code:java} > scala> df.stat.approxQuantile("value", Array(0.9), 0.0) > res10: Array[Double] = Array(0.5929591082174609) > scala> df.stat.approxQuantile("value", Array(0.91), 0.0) > res11: Array[Double] = Array(0.5966354540849995) > scala> df.stat.approxQuantile("value", Array(0.92), 0.0) > res12: Array[Double] = Array(0.6015676591185595) > scala> df.stat.approxQuantile("value", Array(0.93), 0.0) > res13: Array[Double] = Array(0.6029240823799614) > scala> df.stat.approxQuantile("value", Array(0.94), 0.0) > res14: Array[Double] = Array(0.611764547134) > scala> df.stat.approxQuantile("value", Array(0.95), 0.0) > res15: Array[Double] = Array(0.6185162204274052) > scala> df.stat.approxQuantile("value", Array(0.96), 0.0) > res16: Array[Double] = Array(0.625983000807062) > scala> df.stat.approxQuantile("value", Array(0.97), 0.0) > res17: Array[Double] = Array(0.6306892943412258) > scala> df.stat.approxQuantile("value", Array(0.98), 0.0) > res18: Array[Double] = Array(0.6365567375994333) > scala> df.stat.approxQuantile("value", Array(0.99), 0.0) > res19: Array[Double] = Array(0.6554479197566019) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25880) user set some hadoop configurations can not work
[ https://issues.apache.org/jira/browse/SPARK-25880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-25880. -- Resolution: Not A Problem > user set some hadoop configurations can not work > > > Key: SPARK-25880 > URL: https://issues.apache.org/jira/browse/SPARK-25880 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: guojh >Priority: Major > > When user set some hadoop configuration in spark-defaults.conf, for instance: > spark.hadoop.mapreduce.input.fileinputformat.split.maxsize 10 > and then user use the spark-sql and use set command to overwrite this > configuration, but it can not cover the value which set in the file of > spark-defaults.conf. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29453) Improve tooltip information for SQL tab
[ https://issues.apache.org/jira/browse/SPARK-29453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29453. -- Resolution: Not A Problem > Improve tooltip information for SQL tab > --- > > Key: SPARK-29453 > URL: https://issues.apache.org/jira/browse/SPARK-29453 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29585) Duration in stagePage does not match Duration in Summary Metrics for Completed Tasks
[ https://issues.apache.org/jira/browse/SPARK-29585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29585. -- Resolution: Not A Problem > Duration in stagePage does not match Duration in Summary Metrics for > Completed Tasks > > > Key: SPARK-29585 > URL: https://issues.apache.org/jira/browse/SPARK-29585 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: teeyog >Priority: Major > > Summary Metrics for Completed Tasks uses executorRunTime, and Duration in > Task details uses executorRunTime, but Duration in Completed Stages uses > {code:java} > stageData.completionTime - stageData.firstTaskLaunchedTime{code} > , which results in different results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29548) Redirect system print stream to log4j and improve robustness
[ https://issues.apache.org/jira/browse/SPARK-29548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29548. -- Resolution: Won't Fix > Redirect system print stream to log4j and improve robustness > > > Key: SPARK-29548 > URL: https://issues.apache.org/jira/browse/SPARK-29548 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Priority: Major > > In a production environment, user behavior is highly random and uncertain. > For example: Users use `System.out` or `System.err` to print information. > But the system print stream may cause some trouble, such as: the disk file is > too large. > In my production environment, it causes the disk to be full and let > [NodeManager] works not fine. > A method of threat is to forbid the use of `System.out` or `System.err`. But > unfriendly to the users. > A better method is to redirecting the system print stream to `Log4j` and > Spark can take advantage of `Log4j`'s split log. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29751) Scalers use Summarizer instead of MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-29751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29751: Assignee: zhengruifeng > Scalers use Summarizer instead of MultivariateOnlineSummarizer > -- > > Key: SPARK-29751 > URL: https://issues.apache.org/jira/browse/SPARK-29751 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > I did performance tests and found that using Summarizer instead of > MultivariateOnlineSummarizer will speed up the fitting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29751) Scalers use Summarizer instead of MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-29751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29751. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26393 [https://github.com/apache/spark/pull/26393] > Scalers use Summarizer instead of MultivariateOnlineSummarizer > -- > > Key: SPARK-29751 > URL: https://issues.apache.org/jira/browse/SPARK-29751 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > I did performance tests and found that using Summarizer instead of > MultivariateOnlineSummarizer will speed up the fitting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:46 PM: --- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] . Please let me know if anything else is required. I am totally stuck with this issue. was (Author: felixkjose): [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at >
[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:45 PM: --- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] was (Author: felixkjose): [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at
[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-29764: - Attachment: SparkParquetSampleCode.docx > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task 0.0 in stage 0.0 (TID 0, localhost, executor driver): >
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose commented on SPARK-29764: -- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at >
[jira] [Commented] (SPARK-29776) rpad returning invalid value when parameter is empty
[ https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968406#comment-16968406 ] Ankit raj boudh commented on SPARK-29776: - i will start checking this issue. > rpad returning invalid value when parameter is empty > > > Key: SPARK-29776 > URL: https://issues.apache.org/jira/browse/SPARK-29776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > As per rpad definition > rpad > rpad(str, len, pad) - Returns str, right-padded with pad to a length of len > If str is longer than len, the return value is shortened to len characters. > *In case of empty pad string, the return value is null.* > Below is Example > In Spark: > 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); > ++ > | rpad(hi, 5, ) | > ++ > | hi | > ++ > It should return NULL as per definition. > > Hive behavior is correct as per definition it returns NULL when pad is empty > String > INFO : Concurrency mode is disabled, not creating a lock manager > +---+ > | _c0 | > +---+ > | NULL | > +---+ > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29776) rpad returning invalid value when parameter is empty
ABHISHEK KUMAR GUPTA created SPARK-29776: Summary: rpad returning invalid value when parameter is empty Key: SPARK-29776 URL: https://issues.apache.org/jira/browse/SPARK-29776 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA As per rpad definition rpad rpad(str, len, pad) - Returns str, right-padded with pad to a length of len If str is longer than len, the return value is shortened to len characters. *In case of empty pad string, the return value is null.* Below is Example In Spark: 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, ''); ++ | rpad(hi, 5, ) | ++ | hi | ++ It should return NULL as per definition. Hive behavior is correct as per definition it returns NULL when pad is empty String INFO : Concurrency mode is disabled, not creating a lock manager +---+ | _c0 | +---+ | NULL | +---+ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29760) Document VALUES statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968403#comment-16968403 ] Sean R. Owen commented on SPARK-29760: -- That's standard SQL, no? and it works. I don't know if it needs to be documented specifically. Where would you mention it? > Document VALUES statement in SQL Reference. > --- > > Key: SPARK-29760 > URL: https://issues.apache.org/jira/browse/SPARK-29760 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > spark-sql also supports *VALUES *. > {code:java} > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three'); > 1 one > 2 two > 3 three > Time taken: 0.015 seconds, Fetched 3 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2; > 1 one > 2 two > Time taken: 0.014 seconds, Fetched 2 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2; > 1 one > 3 three > 2 two > Time taken: 0.153 seconds, Fetched 3 row(s) > spark-sql> > {code} > or even *values *can be used along with INSERT INTO or select. > refer: https://www.postgresql.org/docs/current/sql-values.html > So please confirm VALUES also can be documented or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29775) Support truncate multiple tables
jobit mathew created SPARK-29775: Summary: Support truncate multiple tables Key: SPARK-29775 URL: https://issues.apache.org/jira/browse/SPARK-29775 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.4 Reporter: jobit mathew Spark sql Support truncate single table like TRUNCATE table t1; But postgresql support truncating multiple tables like TRUNCATE bigtable, fattable; So spark also can support truncating multiple tables [https://www.postgresql.org/docs/12/sql-truncate.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29760) Document VALUES statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967500#comment-16967500 ] jobit mathew edited comment on SPARK-29760 at 11/6/19 1:43 PM: --- [~LI,Xiao] and [~dkbiswal] could you please confirm this. [~srowen] any comment on this? was (Author: jobitmathew): [~LI,Xiao] and [~dkbiswal] could you please confirm this > Document VALUES statement in SQL Reference. > --- > > Key: SPARK-29760 > URL: https://issues.apache.org/jira/browse/SPARK-29760 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: jobit mathew >Priority: Minor > > spark-sql also supports *VALUES *. > {code:java} > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three'); > 1 one > 2 two > 3 three > Time taken: 0.015 seconds, Fetched 3 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2; > 1 one > 2 two > Time taken: 0.014 seconds, Fetched 2 row(s) > spark-sql> > spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2; > 1 one > 3 three > 2 two > Time taken: 0.153 seconds, Fetched 3 row(s) > spark-sql> > {code} > or even *values *can be used along with INSERT INTO or select. > refer: https://www.postgresql.org/docs/current/sql-values.html > So please confirm VALUES also can be documented or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29746) implement validateInputType in Normalizer
[ https://issues.apache.org/jira/browse/SPARK-29746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29746. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26388 [https://github.com/apache/spark/pull/26388] > implement validateInputType in Normalizer > - > > Key: SPARK-29746 > URL: https://issues.apache.org/jira/browse/SPARK-29746 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > UnaryTransformer.transformSchema calls validateInputType method, but this > method is not implemented in ElementwiseProduct, Normalizer and > PolynomialExpansion. Will implement validateInput in ElementwiseProduct, > Normalizer and PolynomialExpansion. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29746) implement validateInputType in Normalizer
[ https://issues.apache.org/jira/browse/SPARK-29746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29746: Assignee: Huaxin Gao > implement validateInputType in Normalizer > - > > Key: SPARK-29746 > URL: https://issues.apache.org/jira/browse/SPARK-29746 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > UnaryTransformer.transformSchema calls validateInputType method, but this > method is not implemented in ElementwiseProduct, Normalizer and > PolynomialExpansion. Will implement validateInput in ElementwiseProduct, > Normalizer and PolynomialExpansion. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27763) Port test cases from PostgreSQL to Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-27763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968346#comment-16968346 ] Takeshi Yamamuro commented on SPARK-27763: -- I've finished checking `create_table.sql` and `alter_table.sql` query-by-query and I think most parts of the queries in them are less meaning for Spark and the most issues there already have been filed in SPARK-27764. So, I think we could close this as resolved for now. WDYT? - create_table.sql: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/create_table.sql - alter_table.sql: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/alter_table.sql > Port test cases from PostgreSQL to Spark SQL > > > Key: SPARK-27763 > URL: https://issues.apache.org/jira/browse/SPARK-27763 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > To improve the test coverage, we can port the regression tests from the other > popular open source projects to Spark SQL. PostgreSQL is one of the best SQL > systems. Below are the links to the test cases and results. > * Regression test cases: > [https://github.com/postgres/postgres/tree/master/src/test/regress/sql] > * Expected results: > [https://github.com/postgres/postgres/tree/master/src/test/regress/expected] > Spark SQL does not support all the feature sets of PostgreSQL. In the current > stage, we should first comment out these test cases and create the > corresponding JIRAs in SPARK-27764. We can discuss and prioritize which > features we should support. Also, these PostgreSQL regression tests could > also expose the existing bugs of Spark SQL. We should also create the JIRAs > and track them in SPARK-27764. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28296) Improved VALUES support
[ https://issues.apache.org/jira/browse/SPARK-28296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968332#comment-16968332 ] jobit mathew commented on SPARK-28296: -- [~petertoth] there are some more valid commands supporting in postgresql, but which are not supporting in spark sql. VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only; VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row; || ||column1||column2|| |1|1|one| |2|2|two| |3|3|three| || ||column1||column2|| |1|2|two| |2|3|three| but same in spark-sql spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only; Error in query: mismatched input 'FIRST' expecting (line 1, pos 50) == SQL == VALUES (1, 'one'), (2, 'two'), (3, 'three') FETCH FIRST 3 rows only --^^^ spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row; Error in query: mismatched input '1' expecting (line 1, pos 51) == SQL == VALUES (1, 'one'), (2, 'two'), (3, 'three') OFFSET 1 row ---^^^ spark-sql> > Improved VALUES support > --- > > Key: SPARK-28296 > URL: https://issues.apache.org/jira/browse/SPARK-28296 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > These are valid queries in PostgreSQL, but they don't work in Spark SQL: > {noformat} > values ((select 1)); > values ((select c from test1)); > select (values(c)) from test10; > with cte(foo) as ( values(42) ) values((select foo from cte)); > {noformat} > where test1 and test10: > {noformat} > CREATE TABLE test1 (c INTEGER); > INSERT INTO test1 VALUES(1); > CREATE TABLE test10 (c INTEGER); > INSERT INTO test10 SELECT generate_sequence(1, 10); > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29773) Unable to process empty ORC files in Hive Table using Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-29773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968329#comment-16968329 ] Alexander Ermakov commented on SPARK-29773: --- We will try on 2.4.4 > Unable to process empty ORC files in Hive Table using Spark SQL > --- > > Key: SPARK-29773 > URL: https://issues.apache.org/jira/browse/SPARK-29773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: Centos 7, Spark 2.3.1, Hive 2.3.0 >Reporter: Alexander Ermakov >Priority: Major > > Unable to process empty ORC files in Hive Table using Spark SQL. It seems > that a problem with class > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits() > Stack trace: > {code:java} > 19/10/30 22:29:54 ERROR SparkSQLDriver: Failed in [select distinct > _tech_load_dt from dl_raw.tpaccsieee_ut_data_address] > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange hashpartitioning(_tech_load_dt#1374, 200) > +- *(1) HashAggregate(keys=[_tech_load_dt#1374], functions=[], > output=[_tech_load_dt#1374]) >+- HiveTableScan [_tech_load_dt#1374], HiveTableRelation > `dl_raw`.`tpaccsieee_ut_data_address`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, [address#1307, address_9zp#1308, > address_adm#1309, address_md#1310, adress_doc#1311, building#1312, > change_date_addr_el#1313, change_date_okato#1314, change_date_окато#1315, > city#1316, city_id#1317, cnv_cont_id#1318, code_intercity#1319, > code_kladr#1320, code_plan1#1321, date_act#1322, date_change#1323, > date_prz_incorrect_code_kladr#1324, date_record#1325, district#1326, > district_id#1327, etaj#1328, e_plan#1329, fax#1330, ... 44 more fields] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:150) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:324) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:122) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver$$anonfun$run$1.apply(SparkSQLDriver.scala:64) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:272) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Updated] (SPARK-16872) Impl Gaussian Naive Bayes Classifier
[ https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-16872: - Summary: Impl Gaussian Naive Bayes Classifier (was: Include Gaussian Naive Bayes Classifier) > Impl Gaussian Naive Bayes Classifier > > > Key: SPARK-16872 > URL: https://issues.apache.org/jira/browse/SPARK-16872 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}. > In GaussianNB model, the {{theta}} matrix is used to store means and there is > a extra {{sigma}} matrix storing the variance of each feature. > GaussianNB in spark > {code} > scala> import org.apache.spark.ml.classification.GaussianNaiveBayes > import org.apache.spark.ml.classification.GaussianNaiveBayes > scala> val path = > "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > path: String = > /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt > scala> val data = spark.read.format("libsvm").load(path).persist() > data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: > double, features: vector] > scala> val gnb = new GaussianNaiveBayes() > gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c > scala> val model = gnb.fit(data) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 > storageLevel=StorageLevel(1 replicas) > 17/01/03 14:25:48 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3} > 17/01/03 14:25:49 INFO Instrumentation: > GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished > model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = > GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes > scala> model.pi > res0: org.apache.spark.ml.linalg.Vector = > [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098] > scala> model.pi.toArray.map(math.exp) > res1: Array[Double] = Array(0., 0., > 0.) > scala> model.theta > res2: org.apache.spark.ml.linalg.Matrix = > 0.270067018001 -0.188540006 0.543050720001 0.60546 > -0.60779998 0.18172 -0.842711740006 > -0.88139998 > -0.091425964 -0.35858001 0.105084738 > 0.021666701507102017 > scala> model.sigma > res3: org.apache.spark.ml.linalg.Matrix = > 0.1223012510889361 0.07078051983960698 0.0343595243976 > 0.051336071297393815 > 0.03758145300924998 0.09880280046403413 0.003390296940069426 > 0.007822241779598893 > 0.08058763609659315 0.06701386661293329 0.024866409227781675 > 0.02661391644759426 > scala> model.transform(data).select("probability").take(10) > [rdd_68_0] > res4: Array[org.apache.spark.sql.Row] = > Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], > [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], > [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], > [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], > [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], > [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], > [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], > [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], > [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], > [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]]) > scala> model.transform(data).select("prediction").take(10) > [rdd_68_0] > res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], > [0.0], [1.0], [2.0], [2.0], [1.0], [0.0]) > {code} > GaussianNB in scikit-learn > {code} > import numpy as np > from sklearn.naive_bayes import GaussianNB > from sklearn.datasets import load_svmlight_file > path = > '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt' > X, y = load_svmlight_file(path) > X = X.toarray() > clf = GaussianNB() > clf.fit(X, y) > >>> clf.class_prior_ > array([ 0., 0., 0.]) > >>> clf.theta_ > array([[ 0.2701, -0.1885, 0.54305072, 0.6055], >[-0.6078, 0.1817, -0.84271174, -0.8814], >[-0.0914, -0.3586, 0.10508474, 0.0216667 ]]) > > >>> clf.sigma_ > array([[ 0.12230125, 0.07078052, 0.03430001, 0.05133607],