[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012562#comment-17012562 ] Jungtaek Lim commented on SPARK-28594: -- I'm enumerating the items which are "good to do", which might be better to file JIRA issues once we decide we should do them, or all required functionalities are done and we have a resource to deal with them. For now, the items what I have are below: * Retain specific number of jobs / executions which allows compact file to have some of finished jobs / executions ** [https://github.com/apache/spark/pull/27085#discussion_r363428336] * Separate compaction from cleaning to allow leaving some old event log files after compaction ** [https://github.com/apache/spark/pull/27085#issuecomment-572792067] * Cache the state of compactor to avoid replaying event log files previously loaded before ** [https://github.com/apache/spark/pull/26416#discussion_r358260674] > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Priority: Major > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012575#comment-17012575 ] Dongjoon Hyun commented on SPARK-29988: --- Oops. [~shaneknapp]. I forgot that we need the following two. - `spark-master-test-maven-hadoop-2.7-hive-2.3` - `spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11` I guess we don't need to add SBT build (`spark-master-test-sbt-hadoop-2.7-hive-1.2`). cc [~smilegator], [~yumwang], [~srowen]. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012575#comment-17012575 ] Dongjoon Hyun edited comment on SPARK-29988 at 1/10/20 8:43 AM: Oops. [~shaneknapp]. I forgot that we need the following two. - `spark-master-test-maven-hadoop-2.7-hive-2.3` - `spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11` As I described in the JIRA description, SPARK-29981 is resolved. So, we need the above. Since we have too many jobs already, I guess we don't need to add SBT build (`spark-master-test-sbt-hadoop-2.7-hive-1.2`) instead. cc [~smilegator], [~yumwang], [~srowen]. was (Author: dongjoon): Oops. [~shaneknapp]. I forgot that we need the following two. - `spark-master-test-maven-hadoop-2.7-hive-2.3` - `spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11` I guess we don't need to add SBT build (`spark-master-test-sbt-hadoop-2.7-hive-1.2`). cc [~smilegator], [~yumwang], [~srowen]. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012575#comment-17012575 ] Dongjoon Hyun edited comment on SPARK-29988 at 1/10/20 8:44 AM: Oops. [~shaneknapp]. I forgot that we need the following two. - `spark-master-test-maven-hadoop-2.7-hive-2.3` - `spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11` Since we have too many jobs already, I guess we don't need to add SBT build (`spark-master-test-sbt-hadoop-2.7-hive-1.2`) instead. cc [~smilegator], [~yumwang], [~srowen]. was (Author: dongjoon): Oops. [~shaneknapp]. I forgot that we need the following two. - `spark-master-test-maven-hadoop-2.7-hive-2.3` - `spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11` As I described in the JIRA description, SPARK-29981 is resolved. So, we need the above. Since we have too many jobs already, I guess we don't need to add SBT build (`spark-master-test-sbt-hadoop-2.7-hive-1.2`) instead. cc [~smilegator], [~yumwang], [~srowen]. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30482) Add sub-class of AppenderSkeleton reusable in tests
Maxim Gekk created SPARK-30482: -- Summary: Add sub-class of AppenderSkeleton reusable in tests Key: SPARK-30482 URL: https://issues.apache.org/jira/browse/SPARK-30482 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 2.4.4 Reporter: Maxim Gekk Some tests define similar sub-class of AppenderSkeleton. The code duplication can be eliminated by defining common class in [SparkFunSuite.scala|https://github.com/apache/spark/compare/master...MaxGekk:dedup-appender-skeleton?expand=1#diff-d521001af1af1a2aace870feb25ae0b0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30482) Add sub-class of AppenderSkeleton reusable in tests
[ https://issues.apache.org/jira/browse/SPARK-30482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30482: --- Component/s: (was: SQL) > Add sub-class of AppenderSkeleton reusable in tests > --- > > Key: SPARK-30482 > URL: https://issues.apache.org/jira/browse/SPARK-30482 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > Some tests define similar sub-class of AppenderSkeleton. The code duplication > can be eliminated by defining common class in > [SparkFunSuite.scala|https://github.com/apache/spark/compare/master...MaxGekk:dedup-appender-skeleton?expand=1#diff-d521001af1af1a2aace870feb25ae0b0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30018) Support ALTER DATABASE SET OWNER syntax
[ https://issues.apache.org/jira/browse/SPARK-30018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30018. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26775 [https://github.com/apache/spark/pull/26775] > Support ALTER DATABASE SET OWNER syntax > --- > > Key: SPARK-30018 > URL: https://issues.apache.org/jira/browse/SPARK-30018 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role; > -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30018) Support ALTER DATABASE SET OWNER syntax
[ https://issues.apache.org/jira/browse/SPARK-30018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30018: --- Assignee: Kent Yao > Support ALTER DATABASE SET OWNER syntax > --- > > Key: SPARK-30018 > URL: https://issues.apache.org/jira/browse/SPARK-30018 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > {code:sql} > ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role; > -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
[ https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012620#comment-17012620 ] pavithra ramachandran commented on SPARK-27148: --- [~maropu] I would like to work on this.. > Support CURRENT_TIME and LOCALTIME when ANSI mode enabled > - > > Key: SPARK-27148 > URL: https://issues.apache.org/jira/browse/SPARK-27148 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > CURRENT_TIME and LOCALTIME should be supported in the ANSI standard; > {code:java} > postgres=# select CURRENT_TIME; > timetz > > 16:45:43.398109+09 > (1 row) > postgres=# select LOCALTIME; > time > > 16:45:48.60969 > (1 row){code} > Before this, we need to support TIME types (java.sql.Time). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30483) Job History does not show pool properties table
ABHISHEK KUMAR GUPTA created SPARK-30483: Summary: Job History does not show pool properties table Key: SPARK-30483 URL: https://issues.apache.org/jira/browse/SPARK-30483 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA Stage will show the Pool Name column but when user clicks the hyper link it will not redirect to Pool Properties Table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30483) Job History does not show pool properties table
[ https://issues.apache.org/jira/browse/SPARK-30483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012654#comment-17012654 ] pavithra ramachandran commented on SPARK-30483: --- i shall work on this > Job History does not show pool properties table > --- > > Key: SPARK-30483 > URL: https://issues.apache.org/jira/browse/SPARK-30483 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Stage will show the Pool Name column but when user clicks the hyper link Name> it will not redirect to Pool Properties Table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30484) Job History Storage Tab does not display RDD Table
ABHISHEK KUMAR GUPTA created SPARK-30484: Summary: Job History Storage Tab does not display RDD Table Key: SPARK-30484 URL: https://issues.apache.org/jira/browse/SPARK-30484 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA scala> import org.apache.spark.storage.StorageLevel._ import org.apache.spark.storage.StorageLevel._ scala> val rdd = sc.range(0, 100, 1, 5).setName("rdd") rdd: org.apache.spark.rdd.RDD[Long] = rdd MapPartitionsRDD[1] at range at :27 scala> rdd.persist(MEMORY_ONLY_SER) res0: rdd.type = rdd MapPartitionsRDD[1] at range at :27 scala> rdd.count res1: Long = 100 scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name") df: org.apache.spark.sql.DataFrame = [count: int, name: string] scala> df.persist(DISK_ONLY) res2: df.type = [count: int, name: string] scala> df.count res3: Long = 3 Open Storage Tab under Incomplete Jobs in Job History Page UI will not display the RDD Table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30484) Job History Storage Tab does not display RDD Table
[ https://issues.apache.org/jira/browse/SPARK-30484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012656#comment-17012656 ] pavithra ramachandran commented on SPARK-30484: --- i shall work on this > Job History Storage Tab does not display RDD Table > -- > > Key: SPARK-30484 > URL: https://issues.apache.org/jira/browse/SPARK-30484 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > scala> import org.apache.spark.storage.StorageLevel._ > import org.apache.spark.storage.StorageLevel._ > scala> val rdd = sc.range(0, 100, 1, 5).setName("rdd") > rdd: org.apache.spark.rdd.RDD[Long] = rdd MapPartitionsRDD[1] at range at > :27 > scala> rdd.persist(MEMORY_ONLY_SER) > res0: rdd.type = rdd MapPartitionsRDD[1] at range at :27 > scala> rdd.count > res1: Long = 100 > > scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", > "name") > df: org.apache.spark.sql.DataFrame = [count: int, name: string] > scala> df.persist(DISK_ONLY) > res2: df.type = [count: int, name: string] > scala> df.count > res3: Long = 3 > Open Storage Tab under Incomplete Jobs in Job History Page > UI will not display the RDD Table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30485) Remove SQL configs deprecated before v2.4
Maxim Gekk created SPARK-30485: -- Summary: Remove SQL configs deprecated before v2.4 Key: SPARK-30485 URL: https://issues.apache.org/jira/browse/SPARK-30485 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Remove the following SQL configs: * spark.sql.variable.substitute.depth * spark.sql.execution.pandas.respectSessionTimeZone * spark.sql.parquet.int64AsTimestampMillis * Maybe spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName which was deprecated in v2.4 Recently all deprecated SQL configs were gathered to the deprecatedSQLConfigs map: https://github.com/apache/spark/blob/1ffa627ffb93dc1027cb4b72f36ec9b7319f48e4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2160-L2189 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30485) Remove SQL configs deprecated before v2.4
[ https://issues.apache.org/jira/browse/SPARK-30485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012678#comment-17012678 ] Maxim Gekk commented on SPARK-30485: [~dongjoon] [~srowen] [~cloud_fan] [~hyukjin.kwon] WDYT of the removing? > Remove SQL configs deprecated before v2.4 > - > > Key: SPARK-30485 > URL: https://issues.apache.org/jira/browse/SPARK-30485 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Remove the following SQL configs: > * spark.sql.variable.substitute.depth > * spark.sql.execution.pandas.respectSessionTimeZone > * spark.sql.parquet.int64AsTimestampMillis > * Maybe spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName > which was deprecated in v2.4 > Recently all deprecated SQL configs were gathered to the deprecatedSQLConfigs > map: > https://github.com/apache/spark/blob/1ffa627ffb93dc1027cb4b72f36ec9b7319f48e4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2160-L2189 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012717#comment-17012717 ] Gabor Somogyi commented on SPARK-30460: --- [~Sachin] Do I understand it correctly that you're using S3 as checkpoint location? If so then all I can say it's not working because S3 read-after-write consistency model. In Spark 3.0 there is a new output committer where the expectation is that it will work but not yet deeply tested... > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012728#comment-17012728 ] Sachin Pasalkar commented on SPARK-30460: - [~gsomogyi] Yes I am using S3 for checkpoint and as we know S3 do not support appending object. However, if you look at the exception stack-trace, it seems it is trying to append the object, which causing failure. If you follow the stack trace `FileBasedWriteAheadLogWriter` gets `outputstream` using HDFSUtils. However HDFSUtils, only supports case for HDFS not for the other non append-able system. I don't see it as issue of consistency model but bug in code > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:118
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012756#comment-17012756 ] Gabor Somogyi commented on SPARK-30460: --- [~Sachin] even if somebody hunt down this specific issue S3 checkpoint makes streaming jobs dead many other different ways. > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.
[jira] [Comment Edited] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012756#comment-17012756 ] Gabor Somogyi edited comment on SPARK-30460 at 1/10/20 11:40 AM: - [~Sachin] even if somebody hunt down this specific issue S3 checkpoint makes streaming jobs dead many other ways. was (Author: gsomogyi): [~Sachin] even if somebody hunt down this specific issue S3 checkpoint makes streaming jobs dead many other different ways. > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOut
[jira] [Commented] (SPARK-27148) Support CURRENT_TIME and LOCALTIME when ANSI mode enabled
[ https://issues.apache.org/jira/browse/SPARK-27148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012762#comment-17012762 ] Takeshi Yamamuro commented on SPARK-27148: -- Yea, that's ok. > Support CURRENT_TIME and LOCALTIME when ANSI mode enabled > - > > Key: SPARK-27148 > URL: https://issues.apache.org/jira/browse/SPARK-27148 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > > CURRENT_TIME and LOCALTIME should be supported in the ANSI standard; > {code:java} > postgres=# select CURRENT_TIME; > timetz > > 16:45:43.398109+09 > (1 row) > postgres=# select LOCALTIME; > time > > 16:45:48.60969 > (1 row){code} > Before this, we need to support TIME types (java.sql.Time). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012784#comment-17012784 ] Sachin Pasalkar commented on SPARK-30460: - Yes may be or may be not. I was able to run this on my production for 4-6 hours without any other issues for 4-5 times. It always failed with this issue. If this fix the some part of problem we should fix it. I understand spark 3.0 has new committer but as you said it is not deeply tested. Soon I am going to run my Production with this fix in place, I will update ticket around next EOW. If I was able to run system smoothly or not > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazo
[jira] [Comment Edited] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012784#comment-17012784 ] Sachin Pasalkar edited comment on SPARK-30460 at 1/10/20 12:19 PM: --- [~gsomogyi] Yes may be or may be not. I was able to run this on my production for 4-6 hours without any other issues for 4-5 times. It always failed with this issue. If this fix the some part of problem we should fix it. I understand spark 3.0 has new committer but as you said it is not deeply tested. Soon I am going to run my Production with this fix in place, I will update ticket around next EOW. If I was able to run system smoothly or not was (Author: sachin): Yes may be or may be not. I was able to run this on my production for 4-6 hours without any other issues for 4-5 times. It always failed with this issue. If this fix the some part of problem we should fix it. I understand spark 3.0 has new committer but as you said it is not deeply tested. Soon I am going to run my Production with this fix in place, I will update ticket around next EOW. If I was able to run system smoothly or not > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$
[jira] [Resolved] (SPARK-30447) Constant propagation nullability issue
[ https://issues.apache.org/jira/browse/SPARK-30447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30447. -- Fix Version/s: 3.0.0 Assignee: Peter Toth Resolution: Fixed Resolved by https://github.com/apache/spark/pull/27119 > Constant propagation nullability issue > -- > > Key: SPARK-30447 > URL: https://issues.apache.org/jira/browse/SPARK-30447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > > There is a bug in constant propagation due to null handling: > SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is > null due to 1 + 1 = 1 propagation, but it shouldn't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30476) NullPointerException when Insert data to hive mongo external table by spark-sql
[ https://issues.apache.org/jira/browse/SPARK-30476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiongCheng updated SPARK-30476: --- Summary: NullPointerException when Insert data to hive mongo external table by spark-sql (was: NullPointException when Insert data to hive mongo external table by spark-sql) > NullPointerException when Insert data to hive mongo external table by > spark-sql > --- > > Key: SPARK-30476 > URL: https://issues.apache.org/jira/browse/SPARK-30476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 > Environment: mongo-hadoop: 2.0.2 > spark-version: 2.4.3 > scala-version: 2.11 > hive-version: 1.2.1 > hadoop-version: 2.6.0 >Reporter: XiongCheng >Priority: Major > > I execute the sql,but i got a NPE. > result_data_mongo is a mongodb hive external table. > {code:java} > insert into result_data_mongo > values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15); > {code} > NPE detail: > {code:java} > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at > com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264) > at > com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59) > at > com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80) > at > com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) > ... 15 more > {code} > I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified > the source code of mongo-hadoop to get the correct properties > ('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get > the value. I found that these parameters are stored in spark In > TaskAttemptContext, but TaskAttemptContext is not passed into > HiveOutputWriter, is this a design flaw? > here are two key point. > mongo-hadoop: > [https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80] > spark-hive:[https://github.com/apache/spark/blob/7c7d7f6a878b02ece881266ee538f3e1443aa8c1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala#L103] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30234) ADD FILE can not add folder from Spark-sql
[ https://issues.apache.org/jira/browse/SPARK-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30234. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26863 [https://github.com/apache/spark/pull/26863] > ADD FILE can not add folder from Spark-sql > -- > > Key: SPARK-30234 > URL: https://issues.apache.org/jira/browse/SPARK-30234 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > We cannot add directories using spark-sql CLI. > In SPARK-4687 support was added for directories as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30234) ADD FILE can not add folder from Spark-sql
[ https://issues.apache.org/jira/browse/SPARK-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30234: Assignee: Rakesh Raushan > ADD FILE can not add folder from Spark-sql > -- > > Key: SPARK-30234 > URL: https://issues.apache.org/jira/browse/SPARK-30234 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > > We cannot add directories using spark-sql CLI. > In SPARK-4687 support was added for directories as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30485) Remove SQL configs deprecated before v2.4
[ https://issues.apache.org/jira/browse/SPARK-30485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012872#comment-17012872 ] Sean R. Owen commented on SPARK-30485: -- We had previously removed methods and APIs that were deprecated in 2.3 or earlier, so I think this would be consistent. > Remove SQL configs deprecated before v2.4 > - > > Key: SPARK-30485 > URL: https://issues.apache.org/jira/browse/SPARK-30485 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > Remove the following SQL configs: > * spark.sql.variable.substitute.depth > * spark.sql.execution.pandas.respectSessionTimeZone > * spark.sql.parquet.int64AsTimestampMillis > * Maybe spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName > which was deprecated in v2.4 > Recently all deprecated SQL configs were gathered to the deprecatedSQLConfigs > map: > https://github.com/apache/spark/blob/1ffa627ffb93dc1027cb4b72f36ec9b7319f48e4/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2160-L2189 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently
[ https://issues.apache.org/jira/browse/SPARK-30480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30480: - Fix Version/s: (was: 3.0.0) > Pyspark test "test_memory_limit" fails consistently > --- > > Key: SPARK-30480 > URL: https://issues.apache.org/jira/browse/SPARK-30480 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > I'm seeing consistent pyspark test failures on multiple PRs > ([#26955|https://github.com/apache/spark/pull/26955], > [#26201|https://github.com/apache/spark/pull/26201], > [#27064|https://github.com/apache/spark/pull/27064]), and they all failed > from "test_memory_limit". > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently
[ https://issues.apache.org/jira/browse/SPARK-30480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-30480: -- Reverted at [https://github.com/apache/spark/commit/d0983af38ffb123fa440bc5fcf3912db7658dd28] > Pyspark test "test_memory_limit" fails consistently > --- > > Key: SPARK-30480 > URL: https://issues.apache.org/jira/browse/SPARK-30480 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > I'm seeing consistent pyspark test failures on multiple PRs > ([#26955|https://github.com/apache/spark/pull/26955], > [#26201|https://github.com/apache/spark/pull/26201], > [#27064|https://github.com/apache/spark/pull/27064]), and they all failed > from "test_memory_limit". > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport] > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource
[ https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-30448. --- Fix Version/s: 3.0.0 Resolution: Fixed > accelerator aware scheduling enforce cores as limiting resource > --- > > Key: SPARK-30448 > URL: https://issues.apache.org/jira/browse/SPARK-30448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > For the first version of accelerator aware scheduling(SPARK-27495), the SPIP > had a condition that we can support dynamic allocation because we were going > to have a strict requirement that we don't waste any resources. This means > that the number of number of slots each executor has could be calculated from > the number of cores and task cpus just as is done today. > Somewhere along the line of development we relaxed that and only warn when we > are wasting resources. This breaks the dynamic allocation logic if the > limiting resource is no longer the cores. This means we will request less > executors then we really need to run everything. > We have to enforce that cores is always the limiting resource so we should > throw if its not. > I guess we could only make this a requirement with dynamic allocation on, but > to make the behavior consistent I would say we just require it across the > board. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource
[ https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-30448: - Assignee: Thomas Graves > accelerator aware scheduling enforce cores as limiting resource > --- > > Key: SPARK-30448 > URL: https://issues.apache.org/jira/browse/SPARK-30448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the first version of accelerator aware scheduling(SPARK-27495), the SPIP > had a condition that we can support dynamic allocation because we were going > to have a strict requirement that we don't waste any resources. This means > that the number of number of slots each executor has could be calculated from > the number of cores and task cpus just as is done today. > Somewhere along the line of development we relaxed that and only warn when we > are wasting resources. This breaks the dynamic allocation logic if the > limiting resource is no longer the cores. This means we will request less > executors then we really need to run everything. > We have to enforce that cores is always the limiting resource so we should > throw if its not. > I guess we could only make this a requirement with dynamic allocation on, but > to make the behavior consistent I would say we just require it across the > board. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30343) Skip unnecessary checks in RewriteDistinctAggregates
[ https://issues.apache.org/jira/browse/SPARK-30343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30343. -- Fix Version/s: 3.0.0 Assignee: Takeshi Yamamuro Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26997 > Skip unnecessary checks in RewriteDistinctAggregates > > > Key: SPARK-30343 > URL: https://issues.apache.org/jira/browse/SPARK-30343 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path
[ https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012928#comment-17012928 ] Gabor Somogyi commented on SPARK-30460: --- [~Sachin] OK, good luck then :) > Spark checkpoint failing after some run with S3 path > - > > Key: SPARK-30460 > URL: https://issues.apache.org/jira/browse/SPARK-30460 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Sachin Pasalkar >Priority: Major > > We are using EMR with the SQS as source of stream. However it is failing, > after 4-6 hours of run, with below exception. Application shows its running > but stops the processing the messages > {code:java} > 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] > org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1226 cap=1226],1578315850302,Future())) > java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] > org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown > while writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.UnsupportedOperationException > at > com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) >
[jira] [Resolved] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30196. -- Resolution: Fixed > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012938#comment-17012938 ] Takeshi Yamamuro commented on SPARK-30196: -- v1.7.1 will be released in the end of next week: https://github.com/lz4/lz4-java/issues/156#issuecomment-573063299 I'll close this and make a new jira for that. > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30486) Bump lz4-java version to 1.7.1
Takeshi Yamamuro created SPARK-30486: Summary: Bump lz4-java version to 1.7.1 Key: SPARK-30486 URL: https://issues.apache.org/jira/browse/SPARK-30486 Project: Spark Issue Type: Improvement Components: Build, Spark Core Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro lz4-java v1.7.0 has an issue on older macOS (e.g., v10.12 and v10.13). Since v1.7.1 will be released in the end of next week, we need to upgrade: https://github.com/lz4/lz4-java/issues/156#issuecomment-573063299 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012939#comment-17012939 ] Takeshi Yamamuro commented on SPARK-30196: -- https://issues.apache.org/jira/browse/SPARK-30486 > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30487) Hive MetaException
Rakesh yadav created SPARK-30487: Summary: Hive MetaException Key: SPARK-30487 URL: https://issues.apache.org/jira/browse/SPARK-30487 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.4 Reporter: Rakesh yadav Fix For: 2.3.5 Hi , I am getting below error INFO TransactionTableCreation: Exception Occurred - [Ljava.lang.StackTraceElement;@4fd7c296 20/01/10 14:09:07 INFO TransactionTableCreation: Exception Occurred - Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013039#comment-17013039 ] Lars Francke commented on SPARK-30196: -- Excellent, thank you! > Bump lz4-java version to 1.7.0 > -- > > Key: SPARK-30196 > URL: https://issues.apache.org/jira/browse/SPARK-30196 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30468: Assignee: Zhenhua Wang > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > > Currently data columns are displayed in one line for show create table > command, when the table has many columns (to make things even worse, columns > may have long names or comments), the displayed result is really hard to read. > To improve readability, we could print each column in a separate line. Note > that other systems like Hive/MySQL also display in this way. > Also, for data columns, table properties and options, we'd better put the > right parenthesis to the end of the last column/property/option, instead of > occupying a separate line. > As a result, before the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1' > ) > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y' > ) > {noformat} > after the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` ( > `col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', > `col3` DOUBLE COMMENT 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1') > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y') > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30468) Use multiple lines to display data columns for show create table command
[ https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30468. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27147 [https://github.com/apache/spark/pull/27147] > Use multiple lines to display data columns for show create table command > > > Key: SPARK-30468 > URL: https://issues.apache.org/jira/browse/SPARK-30468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Minor > Fix For: 3.0.0 > > > Currently data columns are displayed in one line for show create table > command, when the table has many columns (to make things even worse, columns > may have long names or comments), the displayed result is really hard to read. > To improve readability, we could print each column in a separate line. Note > that other systems like Hive/MySQL also display in this way. > Also, for data columns, table properties and options, we'd better put the > right parenthesis to the end of the last column/property/option, instead of > occupying a separate line. > As a result, before the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT > 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1' > ) > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y' > ) > {noformat} > after the change: > {noformat} > spark-sql> show create table test_table; > CREATE TABLE `test_table` ( > `col1` INT COMMENT 'This is comment for column 1', > `col2` STRING COMMENT 'This is comment for column 2', > `col3` DOUBLE COMMENT 'This is comment for column 3') > USING parquet > OPTIONS ( > `bar` '2', > `foo` '1') > TBLPROPERTIES ( > 'a' = 'x', > 'b' = 'y') > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26494) 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,
[ https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013093#comment-17013093 ] Jeff Evans commented on SPARK-26494: To be clear, this type represents an instant in time. From [the docs|https://docs.oracle.com/database/121/SUTIL/GUID-CB5D2124-D9AE-4C71-A83D-DFE071FE3542.htm]: {quote}The TIMESTAMP WITH LOCAL TIME ZONE data type is another variant of TIMESTAMP that includes a time zone offset in its value. Data stored in the database is normalized to the database time zone, and time zone displacement is not stored as part of the column data. When the data is retrieved, it is returned in the user's local session time zone. It is specified as follows:{quote} So it's really almost the same as a {{TIMESTAMP}}, just that it does some kind of automatic TZ conversion (converting from the offset given by the client to the DB server's offset automatically). But that conversion is orthogonal to Spark entirely; it should just be treated like a {{TIMESTAMP}}. > 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type > can't be found, > -- > > Key: SPARK-26494 > URL: https://issues.apache.org/jira/browse/SPARK-26494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: kun'qin >Priority: Minor > > Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be > found, > When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE > At this point, the sqlType value of the function getCatalystType in the > JdbcUtils class is -102. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29779) Compact old event log files and clean up
[ https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-29779: -- Assignee: Jungtaek Lim > Compact old event log files and clean up > > > Key: SPARK-29779 > URL: https://issues.apache.org/jira/browse/SPARK-29779 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. > Please note that this issue leaves below functionalities for future JIRA > issue as the patch for SPARK-29779 is too huge and we decided to break down. > * apply filter in SQL events > * integrate compaction into FsHistoryProvider > * documentation about new configuration -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29779) Compact old event log files and clean up
[ https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29779. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27085 [https://github.com/apache/spark/pull/27085] > Compact old event log files and clean up > > > Key: SPARK-29779 > URL: https://issues.apache.org/jira/browse/SPARK-29779 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. > Please note that this issue leaves below functionalities for future JIRA > issue as the patch for SPARK-29779 is too huge and we decided to break down. > * apply filter in SQL events > * integrate compaction into FsHistoryProvider > * documentation about new configuration -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013143#comment-17013143 ] Shane Knapp commented on SPARK-29988: - got it, i'll get those sorted later today. > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination
[ https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013145#comment-17013145 ] Dongjoon Hyun commented on SPARK-29988: --- Thank you! > Adjust Jenkins jobs for `hive-1.2/2.3` combination > -- > > Key: SPARK-29988 > URL: https://issues.apache.org/jira/browse/SPARK-29988 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > Fix For: 3.0.0 > > Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png > > > We need to rename the following Jenkins jobs first. > spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2 > spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3 > spark-master-test-maven-hadoop-2.7 -> > spark-master-test-maven-hadoop-2.7-hive-1.2 > spark-master-test-maven-hadoop-3.2 -> > spark-master-test-maven-hadoop-3.2-hive-2.3 > Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs. > {code} > -Phive \ > +-Phive-1.2 \ > {code} > And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs. > {code} > -Phive \ > +-Phive-2.3 \ > {code} > Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins > manually. (This should be added to SCM of AmpLab Jenkins.) > After SPARK-29981, we need to create two new jobs. > - spark-master-test-sbt-hadoop-2.7-hive-2.3 > - spark-master-test-maven-hadoop-2.7-hive-2.3 > This is for preparation for Apache Spark 3.0.0. > We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30447) Constant propagation nullability issue
[ https://issues.apache.org/jira/browse/SPARK-30447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30447: -- Affects Version/s: 2.4.4 > Constant propagation nullability issue > -- > > Key: SPARK-30447 > URL: https://issues.apache.org/jira/browse/SPARK-30447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > > There is a bug in constant propagation due to null handling: > SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is > null due to 1 + 1 = 1 propagation, but it shouldn't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30447) Constant propagation nullability issue
[ https://issues.apache.org/jira/browse/SPARK-30447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30447: -- Fix Version/s: 2.4.5 > Constant propagation nullability issue > -- > > Key: SPARK-30447 > URL: https://issues.apache.org/jira/browse/SPARK-30447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > There is a bug in constant propagation due to null handling: > SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is > null due to 1 + 1 = 1 propagation, but it shouldn't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30447) Constant propagation nullability issue
[ https://issues.apache.org/jira/browse/SPARK-30447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013148#comment-17013148 ] Dongjoon Hyun commented on SPARK-30447: --- Hi, [~petertoth]. Could you check the old Spark versions (2.3.4/2.2.3) and update `Affected Versions` please? > Constant propagation nullability issue > -- > > Key: SPARK-30447 > URL: https://issues.apache.org/jira/browse/SPARK-30447 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > There is a bug in constant propagation due to null handling: > SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is > null due to 1 + 1 = 1 propagation, but it shouldn't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30312) Preserve path permission when truncate table
[ https://issues.apache.org/jira/browse/SPARK-30312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30312: -- Affects Version/s: 2.0.2 2.1.3 2.2.3 2.3.4 2.4.4 > Preserve path permission when truncate table > > > Key: SPARK-30312 > URL: https://issues.apache.org/jira/browse/SPARK-30312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > When Spark SQL truncates table, it deletes the paths of table/partitions, > then re-create new ones. If custom permission/acls are set on the paths, the > metadata will be deleted. > We should preserve original permission/acls if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30312) Preserve path permission when truncate table
[ https://issues.apache.org/jira/browse/SPARK-30312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30312: -- Issue Type: Bug (was: Improvement) > Preserve path permission when truncate table > > > Key: SPARK-30312 > URL: https://issues.apache.org/jira/browse/SPARK-30312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > When Spark SQL truncates table, it deletes the paths of table/partitions, > then re-create new ones. If custom permission/acls are set on the paths, the > metadata will be deleted. > We should preserve original permission/acls if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30312) Preserve path permission when truncate table
[ https://issues.apache.org/jira/browse/SPARK-30312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30312. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26956 > Preserve path permission when truncate table > > > Key: SPARK-30312 > URL: https://issues.apache.org/jira/browse/SPARK-30312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > When Spark SQL truncates table, it deletes the paths of table/partitions, > then re-create new ones. If custom permission/acls are set on the paths, the > metadata will be deleted. > We should preserve original permission/acls if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29174) LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source
[ https://issues.apache.org/jira/browse/SPARK-29174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29174: -- Issue Type: Improvement (was: Bug) > LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source > --- > > Key: SPARK-29174 > URL: https://issues.apache.org/jira/browse/SPARK-29174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > *using does not work for insert overwrite when in local but works when > insert overwrite in HDFS directory* > {code} > 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory > '/user/trash2/' using parquet select * from trash1 a where a.country='PAK'; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.448 seconds) > 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory > '/opt/trash2/' using parquet select * from trash1 a where a.country='PAK'; > Error: org.apache.spark.sql.catalyst.parser.ParseException: > LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, > pos 0) > > == SQL == > insert overwrite local directory '/opt/trash2/' using parquet select * from > trash1 a where a.country='PAK' > ^^^ (state=,code=0) > 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory > '/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK'; > +-+--+ > | Result | > +-+--+ > | | | > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26494) Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type
[ https://issues.apache.org/jira/browse/SPARK-26494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Evans updated SPARK-26494: --- Summary: Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type (was: 【spark sql】Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be found,) > Support Oracle TIMESTAMP WITH LOCAL TIME ZONE type > -- > > Key: SPARK-26494 > URL: https://issues.apache.org/jira/browse/SPARK-26494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: kun'qin >Priority: Minor > > Use spark to read oracle TIMESTAMP(6) WITH LOCAL TIME ZONE type can't be > found, > When the data type is TIMESTAMP(6) WITH LOCAL TIME ZONE > At this point, the sqlType value of the function getCatalystType in the > JdbcUtils class is -102. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30488) Deadlock between block-manager-slave-async-thread-pool and spark context cleaner
Rohit Agrawal created SPARK-30488: - Summary: Deadlock between block-manager-slave-async-thread-pool and spark context cleaner Key: SPARK-30488 URL: https://issues.apache.org/jira/browse/SPARK-30488 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Reporter: Rohit Agrawal Deadlock happens while cleaning up the spark context. Here is the full thread dump: at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:121) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) "Spark Context Cleaner": at java.lang.ClassLoader.checkCerts(ClassLoader.java:887) - waiting to lock <0xca33e4c8> (a sbt.internal.ManagedClassLoader$ZombieClassLoader) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668) at java.lang.ClassLoader.defineClass(ClassLoader.java:761) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at sbt.internal.ManagedClassLoader$ZombieClassLoader.lookupClass(LayeredClassLoaders.scala:336) at sbt.internal.ManagedClassLoader.findClass(LayeredClassLoaders.scala:375) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - locked <0xc1f359f0> (a sbt.internal.LayeredClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:138) at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:226) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:192) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) at scala.Option.foreach(Option.scala:257) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) - locked <0xc3d74cd0> (a org.apache.spark.ContextCleaner) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) "block-manager-slave-async-thread-pool-81": at java.lang.ClassLoader.loadClass(ClassLoader.java:404) - waiting to lock <0xc1f359f0> (a sbt.internal.LayeredClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:411) - locked <0xca33e4c8> (a sbt.internal.ManagedClassLoader$ZombieClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:58) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57) at org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:86) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Found 1 deadlock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-29748. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26496 [https://github.com/apache/spark/pull/26496] > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-29748: Assignee: Bryan Cutler > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-22232. -- Resolution: Won't Fix Closing in favor for fix in SPARK-29748 > Row objects in pyspark created using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Priority: Major > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24915) Calling SparkSession.createDataFrame with schema can throw exception
[ https://issues.apache.org/jira/browse/SPARK-24915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24915. -- Resolution: Won't Fix Closing in favor of fix in SPARK-29748 > Calling SparkSession.createDataFrame with schema can throw exception > > > Key: SPARK-24915 > URL: https://issues.apache.org/jira/browse/SPARK-24915 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Python 3.6.3 > PySpark 2.3.1 (installed via pip) > OSX 10.12.6 >Reporter: Stephen Spencer >Priority: Major > > There seems to be a bug in PySpark when using the PySparkSQL session to > create a dataframe with a pre-defined schema. > Code to reproduce the error: > {code:java} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SparkSession > from pyspark.sql.types import StructType, StructField, StringType, Row > conf = SparkConf().setMaster("local").setAppName("repro") > context = SparkContext(conf=conf) > session = SparkSession(context) > # Construct schema (the order of fields is important) > schema = StructType([ > StructField('field2', StructType([StructField('sub_field', StringType(), > False)]), False), > StructField('field1', StringType(), False), > ]) > # Create data to populate data frame > data = [ > Row(field1="Hello", field2=Row(sub_field='world')) > ] > # Attempt to create the data frame supplying the schema > # this will throw a ValueError > df = session.createDataFrame(data, schema=schema) > df.show(){code} > Running this throws a ValueError > {noformat} > Traceback (most recent call last): > File "schema_bug.py", line 18, in > df = session.createDataFrame(data, schema=schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 691, in createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in _createFromLocal > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/session.py", > line 423, in > data = [schema.toInternal(row) for row in data] > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in toInternal > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 601, in > for f, v, c in zip(self.fields, obj, self._needConversion)) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 439, in toInternal > return self.dataType.toInternal(obj) > File > "/Users/stephenspencer/benevolent/ai/neat/rex/.env/lib/python3.6/site-packages/pyspark/sql/types.py", > line 619, in toInternal > raise ValueError("Unexpected tuple %r with StructType" % obj) > ValueError: Unexpected tuple 'Hello' with StructType{noformat} > The problem seems to be here: > https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/python/pyspark/sql/types.py#L603 > specifically the bit > {code:java} > zip(self.fields, obj, self._needConversion) > {code} > This zip statement seems to assume that obj and self.fields are ordered in > the same way, so that the elements of obj will correspond to the right fields > in the schema. However this is not true, a Row orders its elements > alphabetically but the fields in the schema are in whatever order they are > specified. In this example field2 is being initialised with the field1 > element 'Hello'. If you re-order the fields in the schema to go (field1, > field2), the given example works without error. > The schema in the repro is specifically designed to elicit the problem, the > fields are out of alphabetical order and one field is a StructType, making > chema._needSerializeAnyField==True . However we encountered this in real use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation
[ https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-29748: --- Labels: release-notes (was: ) > Remove sorting of fields in PySpark SQL Row creation > > > Key: SPARK-29748 > URL: https://issues.apache.org/jira/browse/SPARK-29748 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > Currently, when a PySpark Row is created with keyword arguments, the fields > are sorted alphabetically. This has created a lot of confusion with users > because it is not obvious (although it is stated in the pydocs) that they > will be sorted alphabetically, and then an error can occur later when > applying a schema and the field order does not match. > The original reason for sorting fields is because kwargs in python < 3.6 are > not guaranteed to be in the same order that they were entered. Sorting > alphabetically would ensure a consistent order. Matters are further > complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to > to be referenced by name when made by kwargs, but this flag is not serialized > with the Row and leads to inconsistent behavior. > This JIRA proposes that any sorting of the Fields is removed. Users with > Python 3.6+ creating Rows with kwargs can continue to do so since Python will > ensure the order is the same as entered. Users with Python < 3.6 will have to > create Rows with an OrderedDict or by using the Row class as a factory > (explained in the pydoc). If kwargs are used, an error will be raised or > based on a conf setting it can fall back to a LegacyRow that will sort the > fields as before. This LegacyRow will be immediately deprecated and removed > once support for Python < 3.6 is dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30489) Make build delete pyspark.zip file properly
Jeff Evans created SPARK-30489: -- Summary: Make build delete pyspark.zip file properly Key: SPARK-30489 URL: https://issues.apache.org/jira/browse/SPARK-30489 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Jeff Evans The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file within {{python/lib}}. The only problem is the Ant task definition for the delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: (was: 2.3.4) > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jeff Evans >Priority: Trivial > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Issue Type: Bug (was: Improvement) > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Jeff Evans >Priority: Trivial > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30489. --- Fix Version/s: 3.0.0 2.4.5 Assignee: Jeff Evans Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/27171 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jeff Evans >Assignee: Jeff Evans >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: 2.3.4 2.4.4 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Jeff Evans >Priority: Trivial > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: 2.3.4 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Jeff Evans >Assignee: Jeff Evans >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: 2.0.2 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Jeff Evans >Assignee: Jeff Evans >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: 2.1.3 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Jeff Evans >Assignee: Jeff Evans >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30489) Make build delete pyspark.zip file properly
[ https://issues.apache.org/jira/browse/SPARK-30489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30489: -- Affects Version/s: 2.2.3 > Make build delete pyspark.zip file properly > --- > > Key: SPARK-30489 > URL: https://issues.apache.org/jira/browse/SPARK-30489 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Jeff Evans >Assignee: Jeff Evans >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > The build uses Ant tasks to delete, then recreate, the {{pyspark.zip}} file > within {{python/lib}}. The only problem is the Ant task definition for the > delete operation is incorrect (it uses {{dir}} instead of {{file}}), so it > doesn't actually get deleted by this task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org